On March 11, 2026, customers experienced pipeline failures and degraded UI performance(incorrect status of states) and CCM Dashboards were not accessible to the affected customers in the Prod2 environment. The issue was caused by a degradation in an internal shared infrastructure component used for coordination across services.
The incident began around 7:10 AM PST and was fully mitigated by approximately 10:12 AM PST. During this period, pipeline execution throughput was significantly impacted for affected customers.
The issue was caused by resource saturation in a shared infrastructure component used for distributed coordination, which led to increased latency and failures in service-to-service communication.
As a result, pipeline execution services were unable to process workloads efficiently, leading to a buildup of queued tasks and reduced system throughput.
Customers experienced the following:
The impact was limited to specific production environments and no data loss occurred.
Immediate
Permanent
To prevent such issues from happening again we are taking several steps: