Customer Overview Page is slow to load and failing in certain instances
Incident Report for Harness
Postmortem

Summary

The 'Customer Overview Page' was loading slowly in the Prod-2 cluster. All other critical functions remained unaffected.

Timeline

Time (UTC) Event
04:30 PM We got an alert, and the customer also reported the issue.
04:45 PM An internal incident was raised, and the team started looking into the issue.
05:11 PM Root cause identified
06:04 PM Incident resolved

Resolution

The high CPU-intensive maintenance task and the long-running queries were terminated to resume normal operations.

RCA

The dashboard failed to retrieve data from the backend database as the CPU utilization had reached > 90%. The alert came into the system as a Warning event that got overlooked. We observed the CPU spike due to maintenance tasks, some sub-optimal queries running on the primary node, and several active connections from the application side. We proceeded after validating that the queries and the maintenance task could be terminated without any potential data loss.

Action Items

  1. We have moved the maintenance tasks to the secondary node.
  2. We are working on addressing the long-running queries coming from the application side.
  3. We are also working on implementing the server-side timeout for long-running queries.
  4. We will ensure the alerts immediately trigger an incident to the person on-call.
Posted Jan 23, 2024 - 10:43 PST

Resolved
The incident has been resolved.
Posted Jan 18, 2024 - 10:03 PST
Monitoring
The issue has been resolved and the overview page is back to normal. We are actively monitoring the systems.
Posted Jan 18, 2024 - 09:58 PST
Identified
The issue has been identified. Team is working to mitigate the issue and provide a solution as soon as possible.
Posted Jan 18, 2024 - 09:53 PST
Update
We are continuing to investigate this issue.
Posted Jan 18, 2024 - 09:20 PST
Investigating
We are currently investigating an issue where customer dashboards are slow to load or failing to load in some specific environment.
This does not impact the pipelines running or deployments.
Posted Jan 18, 2024 - 09:14 PST
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG)).