Customer Overview Page is slow to load and failing in certain instances

Incident Report for Harness

Postmortem

Summary

The 'Customer Overview Page' was loading slowly in the Prod-2 cluster. All other critical functions remained unaffected.

Timeline

Time (UTC)	Event
04:30 PM	We got an alert, and the customer also reported the issue.
04:45 PM	An internal incident was raised, and the team started looking into the issue.
05:11 PM	Root cause identified
06:04 PM	Incident resolved

Resolution

The high CPU-intensive maintenance task and the long-running queries were terminated to resume normal operations.

RCA

The dashboard failed to retrieve data from the backend database as the CPU utilization had reached > 90%. The alert came into the system as a Warning event that got overlooked. We observed the CPU spike due to maintenance tasks, some sub-optimal queries running on the primary node, and several active connections from the application side. We proceeded after validating that the queries and the maintenance task could be terminated without any potential data loss.

Action Items

We have moved the maintenance tasks to the secondary node.
We are working on addressing the long-running queries coming from the application side.
We are also working on implementing the server-side timeout for long-running queries.
We will ensure the alerts immediately trigger an incident to the person on-call.

Posted Jan 23, 2024 - 10:43 PST

Resolved

The incident has been resolved.

Posted Jan 18, 2024 - 10:03 PST

Monitoring

The issue has been resolved and the overview page is back to normal. We are actively monitoring the systems.

Posted Jan 18, 2024 - 09:58 PST

Identified

The issue has been identified. Team is working to mitigate the issue and provide a solution as soon as possible.

Posted Jan 18, 2024 - 09:53 PST

Update

We are continuing to investigate this issue.

Posted Jan 18, 2024 - 09:20 PST

Investigating

We are currently investigating an issue where customer dashboards are slow to load or failing to load in some specific environment.
This does not impact the pipelines running or deployments.

Posted Jan 18, 2024 - 09:14 PST

This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG)).