Pipelines and dashboards are impacted in Prod2

Incident Report for Harness

Postmortem

Summary

On March 11, 2026, customers experienced pipeline failures and degraded UI performance(incorrect status of states) and CCM Dashboards were not accessible to the affected customers in the Prod2 environment. The issue was caused by a degradation in an internal shared infrastructure component used for coordination across services.

The incident began around 7:10 AM PST and was fully mitigated by approximately 10:12 AM PST. During this period, pipeline execution throughput was significantly impacted for affected customers.

Root Cause

The issue was caused by resource saturation in a shared infrastructure component used for distributed coordination, which led to increased latency and failures in service-to-service communication.

As a result, pipeline execution services were unable to process workloads efficiently, leading to a buildup of queued tasks and reduced system throughput.

Impact

Customers experienced the following:

  • Pipeline executions failing or not progressing
  • Increased pipeline execution times
  • UI delays due to processing backlogs

The impact was limited to specific production environments and no data loss occurred.

Mitigation

Immediate

  • Redirected services to a higher-capacity infrastructure instance to restore normal processing
  • Cleared accumulated processing backlogs to recover system throughput
  • Scaled supporting services to stabilize performance

Permanent

  • Improved monitoring and alerting for early detection of resource saturation
  • Implemented capacity and scaling improvements to handle higher load scenarios
  • Initiated architectural improvements to reduce reliance on shared coordination components

Action Items

To prevent such issues from happening again we are taking several steps:

  • Enhance alerting to detect early signs of infrastructure saturation
  • Review and optimize system behavior under high concurrency scenarios
  • Continue investigation into the triggering conditions and incorporate findings into long-term improvements
Posted Mar 17, 2026 - 16:43 PDT

Resolved

This incident has been resolved.
Posted Mar 11, 2026 - 11:45 PDT

Update

A fix has been implemented and we are monitoring the results.
Posted Mar 11, 2026 - 11:33 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 11, 2026 - 11:15 PDT

Update

Currently all the executions are going on track and will complete. The UI is showing a delayed status. We are currently expediting the UI recovery.
Posted Mar 11, 2026 - 11:06 PDT

Update

Pipeline executions are going fine,there is a delay to view it on the UI.
Posted Mar 11, 2026 - 10:18 PDT

Update

Dashboards are in recovering phase. We are continuing to work on a fix for pipelines issue.
Posted Mar 11, 2026 - 09:20 PDT

Update

We are continuing to work on a fix for this issue.
Posted Mar 11, 2026 - 08:55 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted Mar 11, 2026 - 08:42 PDT

Investigating

We are currently investigating this issue.
Posted Mar 11, 2026 - 08:38 PDT
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards).