Harness Overview dashboard is not loading
Incident Report for Harness
Postmortem

Summary

The 'Harness Overview Page' failed to load intermittently in the Prod-2 cluster while the critical functionality remained unaffected.

Timeline

Time (UTC) Event
01:33 PM Received internal alerts for synthetic monitoring failure.
01:47 PM An internal incident was raised.
01:48 PM The root cause was identified.
02:16 PM Incident was resolved

Resolution

The high CPU-intensive queries were terminated in the backend database to resume normal operations.

RCA

The Overview dashboard failed to retrieve data from the backend database as the CPU utilisation reached critical levels resulting in significant delays in processing regular queries. We have isolated configurations within the database which in combination with the application’s retry mechanism lead to undue load on the database server leading to an unhealthy state.

Action Items

  1. We have modified the database configurations which shows significant promise as per initial observations.
  2. We are in the process of moving the queries to horizontally scaled database nodes.
  3. More database optimizations are being planned for the application calls
Posted Mar 22, 2024 - 10:07 PDT

Resolved
This incident has been resolved now.
Posted Mar 20, 2024 - 07:31 PDT
Monitoring
We have fixed the issue with Overview dashboard/landing dashboard and are actively monitoring it.
Posted Mar 20, 2024 - 07:19 PDT
Investigating
We are currently investigating an issue with Harness overview dashboard.
Posted Mar 20, 2024 - 07:01 PDT
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG)).