PROD2: Delegates got disconnected from Harness

Incident Report for Harness

Postmortem

Summary

A subset of Delegates in prod2 cluster got disconnected, causing pipeline failures for customers. It was due to an increased load on the backed database due to an ad-hoc read query.

What was the issue?

Customer delegates were disconnected and pipelines were failing.

Resolution

We cancelled the runaway query, and upscaled the database. Overall recovery took ~17 minutes, and the majority of Kubernetes delegates reconnected automatically. A few of the customers had to restart the non-kubernetes delegates.

RCA

As part of a regular operational work, we ran a read query in the database which spiked the CPU usage on the database. Unfortunately, this query was run against the primary replica, which increased query latency, resulting in some delegates getting marked disconnected.

Action Items

Enhance access control: We have a Just-In-Time read access to our database for operational tasks. We are enhancing our system to only provide access to non-primary replicas for such operations.
Enhanced resiliency: We are planning to run chaos experiments simulating db latency to improve resiliency in our delegate management sub-system against such faults.

Posted Jan 27, 2025 - 14:16 PST

Resolved

All delegate connectivity is resumed. Detailed RCA will follow soon.

Posted Jan 21, 2025 - 03:21 PST

Identified

We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.

Posted Jan 21, 2025 - 03:10 PST

Investigating

Few delegates got disconnected from Harness

Posted Jan 21, 2025 - 02:45 PST

This incident affected: Prod 2 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO)).