Summary
Between 10:07 AM–10:39 AM PST on Tuesday, May 12, 2026, customers using the prod1, prod2, and prod3 Production clusters experienced elevated latency and intermittent service degradation. During this timeframe, customers observed delegate timeouts, login failures, and pipeline execution failures
Root Cause
A recently introduced configuration change to a common infrastructure component caused unexpected resource pressure across nodes in the prod1, prod2, and prod3 production clusters. The peak traffic exacerbated the resource utilization and introduced elevated latency across several critical platform services.
Impact
- Customers in prod1, prod2, and prod3 experienced login and access failures for approximately 20 minutes.
- Delegate connectivity was intermittently impacted during the incident window.
- Pipeline executions and API requests experienced elevated failure rates and latency during the incident window.
Remediation
- Immediately Rolled back to the previous stable release, restoring customer pipeline functionality and alleviating node pressure.
Action Items
- Enhance perf testing to include such workloads so that we can catch issues before we hit production.
- Increase capacity across clusters to make sure we have enough headroom to absorb the traffic surges.