Overview Page is slow to load and failing in certain instances

Incident Report for Harness

Postmortem

Overview

The 'Customer Overview Page' was loading slowly in the Prod-2 cluster. All other critical functions remained unaffected.

‌

Timeline

What was the issue?

The incident occurred when the dashboard failed to retrieve data from the backend database, which was traced back to the CPU utilization of the database exceeding 90%. This critical level of utilization triggered alerts. The surge in CPU usage was primarily due to an increase in load from the application's operations. The simultaneous demands on the database resources led to significant constraints, hindering its ability to process requests efficiently.

‌

Resolution

To mitigate the issue and restore normal operations, immediate action was taken to terminate long-running queries that were contributing to the high CPU utilization. Additionally, the number of data-consuming services was reduced temporarily. These measures effectively decreased the load on the database, allowing its operations to resume at a normal pace and ensuring the availability of the dashboard data retrieval functionality.

‌

Action Items

In response to this incident, the following action items have been identified and are being implemented to prevent recurrence and improve system resilience:

Distribute Database Load: To better manage and distribute the incoming query load, especially during peak times, we will distribute database query load across 2 database instances.
Annotate Logs for Better Analysis: Work is underway to enhance our logging strategy by annotating logs with details that will help in identifying patterns in query behavior. This enhancement will facilitate more granular analysis and understanding of how queries interact with the database resources.

Posted Mar 15, 2024 - 12:42 PDT

Resolved

We can confirm normal operation.
We will continue to monitor and ensure stability.

Posted Mar 14, 2024 - 11:03 PDT

Monitoring

The Overview page latency is back to normal limits at this time. We are still monitoring the system for any issues.

Posted Mar 14, 2024 - 10:26 PDT

Identified

Due to additional load, the system is still not back to normal operations. We are actively debugging this incident.

Posted Mar 14, 2024 - 10:01 PDT

Monitoring

We are monitoring the service to ensure normal performance continues.

Posted Mar 14, 2024 - 09:39 PDT

Identified

The resource constraint has been identified and we are working to mitigate the situation.

Posted Mar 14, 2024 - 09:21 PDT

Investigating

We are currently investigating this issue.

Posted Mar 14, 2024 - 09:07 PDT

This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG)).