Platform is experiencing degraded performance for some organizations.

Incident Report for Harness

Postmortem

Summary

On April 30, 2026, between approximately 15:29 UTC and 17:00 UTC, customers in Prod3 experienced degradation impacting delegate connectivity, instance synchronization, pipeline executions, and connector operations due to spike in load on one of our services. .

Service stability was restored through service scaling, infrastructure capacity increases, and database resource expansion.

Impact

Customer Impact:

Delegates disconnected intermittently during the incident window
Instance synchronization operations were delayed
Some pipeline executions and connector operations experienced failures or delays

Duration:

Delegate connectivity impact: ~15 minutes
Elevated service degradation: ~90 minutes

Root Cause

The incident was caused by spike causing thread exhaustion and elevated request contention between internal services during a period of increased synchronization and delegate activity.

‌

Mitigation and Recovery

The following actions were taken to restore service stability:

Scaled management service replicas horizontally
Increased autoscaling thresholds and maximum replica counts
Expanded Database compute capacity
Upgraded MongoDB infrastructure components
Stabilized delegate reassignment and reconnection processing

Services recovered progressively beginning at approximately 15:47 UTC, with full stability restored by ~17:00 UTC.

Preventive Actions

To prevent such issues from happening again, We are implementing the following improvements

Improving our circuit breakers and fail-fast protections between dependent services
Enhancing monitoring and alerting for thread pool saturation and queue buildup
Increasing baseline service headroom and resiliency protections

Posted May 11, 2026 - 13:51 PDT

Resolved

This incident has been resolved.

Posted Apr 30, 2026 - 10:50 PDT

Identified

Issue has been identified and mitigated

Posted Apr 30, 2026 - 10:27 PDT

Investigating

We are currently investigating this issue.

Posted Apr 30, 2026 - 09:25 PDT

This incident affected: Prod 3 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Chaos Engineering, Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository, Artifact Registry, Platform, FME) and Prod 2 (Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI)).