Platform access issues in Prod1/Prod2/Prod3

Incident Report for Harness

Postmortem

Summary

Between 10:07 AM–10:39 AM PST on Tuesday, May 12, 2026, customers using the prod1, prod2, and prod3 Production clusters experienced elevated latency and intermittent service degradation. During this timeframe, customers observed delegate timeouts, login failures, and pipeline execution failures

Root Cause

A recently introduced configuration change to a common infrastructure component caused unexpected resource pressure across nodes in the prod1, prod2, and prod3 production clusters. The peak traffic exacerbated the resource utilization and introduced elevated latency across several critical platform services.

Impact

  1. Customers in prod1, prod2, and prod3 experienced login and access failures for approximately 20 minutes.
  2. Delegate connectivity was intermittently impacted during the incident window.
  3. Pipeline executions and API requests experienced elevated failure rates and latency during the incident window.

Remediation

  • Immediately Rolled back to the previous stable release, restoring customer pipeline functionality and alleviating node pressure.

Action Items

  1. Enhance perf testing to include such workloads so that we can catch issues before we hit production.
  2. Increase capacity across clusters to make sure we have enough headroom to absorb the traffic surges.
Posted May 12, 2026 - 17:03 PDT

Resolved

This incident has been resolved.
Posted May 12, 2026 - 10:55 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted May 12, 2026 - 10:52 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted May 12, 2026 - 10:46 PDT

Update

We are continuing to investigate this issue.
Posted May 12, 2026 - 10:37 PDT

Investigating

We are currently investigating this issue.
Posted May 12, 2026 - 10:33 PDT
This incident affected: Prod 1 (Platform), Prod 3 (Platform), and Prod 2 (Platform).