A subset of Delegates in prod2 cluster got disconnected, causing pipeline failures for customers. It was due to an increased load on the backed database due to an ad-hoc read query.
Customer delegates were disconnected and pipelines were failing.
We cancelled the runaway query, and upscaled the database. Overall recovery took ~17 minutes, and the majority of Kubernetes delegates reconnected automatically. A few of the customers had to restart the non-kubernetes delegates.
As part of a regular operational work, we ran a read query in the database which spiked the CPU usage on the database. Unfortunately, this query was run against the primary replica, which increased query latency, resulting in some delegates getting marked disconnected.