The 'Customer Overview Page' was loading slowly in the Prod-2 cluster. All other critical functions remained unaffected.
Time (UTC) | Event |
---|---|
04:30 PM | We got an alert, and the customer also reported the issue. |
04:45 PM | An internal incident was raised, and the team started looking into the issue. |
05:11 PM | Root cause identified |
06:04 PM | Incident resolved |
The high CPU-intensive maintenance task and the long-running queries were terminated to resume normal operations.
The dashboard failed to retrieve data from the backend database as the CPU utilization had reached > 90%. The alert came into the system as a Warning event that got overlooked. We observed the CPU spike due to maintenance tasks, some sub-optimal queries running on the primary node, and several active connections from the application side. We proceeded after validating that the queries and the maintenance task could be terminated without any potential data loss.