The 'Customer Overview Page' was loading slowly in the Prod-2 cluster. All other critical functions remained unaffected.
The incident occurred when the dashboard failed to retrieve data from the backend database, which was traced back to the CPU utilization of the database exceeding 90%. This critical level of utilization triggered alerts. The surge in CPU usage was primarily due to an increase in load from the application's operations. The simultaneous demands on the database resources led to significant constraints, hindering its ability to process requests efficiently.
To mitigate the issue and restore normal operations, immediate action was taken to terminate long-running queries that were contributing to the high CPU utilization. Additionally, the number of data-consuming services was reduced temporarily. These measures effectively decreased the load on the database, allowing its operations to resume at a normal pace and ensuring the availability of the dashboard data retrieval functionality.
In response to this incident, the following action items have been identified and are being implemented to prevent recurrence and improve system resilience: