Deployment Degradation – Failures in Prod1,2,3

Incident Report for Harness

Postmortem

Incident Summary

On May 6 at 11:50 PM PST, we deployed a configuration change to one of our core pipeline
services. This change introduced an unintended interaction with our database layer, causing a
significant increase in write load. The resulting pressure degraded query and command
throughput across the platform

Root Cause

The configuration change introduced a blocking condition on expression evaluation in the
pipeline service. When executions encountered blocked expressions, they failed and retried
repeatedly, generating a write storm against the database and the throughput went 4x

Remediation

● Rolled back the configuration change.
● Applied database-level tuning to reduce write pressure and accelerate backlog drainage
● Performed a controlled failover to a healthy database node to restore throughput
● Scaled up database nodes to provide sufficient capacity for full recovery

Preventive Actions

To prevent from such Issues happening again, we are focussing on:

1. Increased database resilience: We are implementing automated load-shedding thresholds
that trigger on leading indicators (replication lag, session depth, op latency) before the database
reaches saturation, preventing retry storms from compounding into full degradation events.

2. We are optimizing databases so that we can increase write throughput by an order of magnitude and enable independent scaling of customer data workloads. This would have allowed us to drain the message backlog nearly instantaneously during this incident.

Posted May 19, 2026 - 17:46 PDT

Resolved

This incident has been resolved.
Posted May 07, 2026 - 04:23 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted May 07, 2026 - 02:05 PDT

Update

We are continuing to investigate this issue.
Posted May 07, 2026 - 01:43 PDT

Update

We are continuing to investigate this issue.
Posted May 07, 2026 - 01:17 PDT

Investigating

Deployment Degradation – Failures in Prod1,2,3
Posted May 07, 2026 - 00:54 PDT
This incident affected: Prod 1 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository, Artifact Registry, Platform, AI Test Automation, FME, Release Management), Prod 2 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository, Artifact Registry, Platform, AI Test Automation, FME), and Prod 3 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Chaos Engineering, Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository, Artifact Registry, Platform, FME).