Platform is experiencing degraded performance for some organizations.

Incident Report for Harness

Postmortem

Summary

On April 30, 2026, between approximately 15:29 UTC and 17:00 UTC, customers in Prod3 experienced degradation impacting delegate connectivity, instance synchronization, pipeline executions, and connector operations due to spike in load on one of our services. .

Service stability was restored through service scaling, infrastructure capacity increases, and database resource expansion.  

Impact

Customer Impact:

  • Delegates disconnected intermittently during the incident window
  • Instance synchronization operations were delayed
  • Some pipeline executions and connector operations experienced failures or delays

Duration:

  • Delegate connectivity impact: ~15 minutes
  • Elevated service degradation: ~90 minutes

Root Cause

The incident was caused by spike causing thread exhaustion and elevated request contention between internal services during a period of increased synchronization and delegate activity.

Mitigation and Recovery

The following actions were taken to restore service stability:

  • Scaled management service replicas horizontally
  • Increased autoscaling thresholds and maximum replica counts
  • Expanded Database compute capacity
  • Upgraded MongoDB infrastructure components
  • Stabilized delegate reassignment and reconnection processing

Services recovered progressively beginning at approximately 15:47 UTC, with full stability restored by ~17:00 UTC.  

Preventive Actions

To prevent such issues from happening again, We are implementing the following improvements

  • Improving our circuit breakers and fail-fast protections between dependent services
  • Enhancing monitoring and alerting for thread pool saturation and queue buildup
  • Increasing baseline service headroom and resiliency protections
Posted May 11, 2026 - 13:51 PDT

Resolved

This incident has been resolved.
Posted Apr 30, 2026 - 10:50 PDT

Identified

Issue has been identified and mitigated
Posted Apr 30, 2026 - 10:27 PDT

Investigating

We are currently investigating this issue.
Posted Apr 30, 2026 - 09:25 PDT
This incident affected: Prod 3 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Chaos Engineering, Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository, Artifact Registry, Platform, FME) and Prod 2 (Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI)).