Summary
On March 6, 2026 , customers on Prod4 experienced a service disruption affecting delegate connectivity, pipeline execution, and CI workflows. The disruption was caused by a configuration change introduced during a scheduled platform upgrade that reduced the connection capacity of our internal routing layer. This reduction, combined with a concurrent update to our delegate management service, caused a burst of delegate reconnections that exceeded the new capacity limit. Service was fully restored via rollback within 42 minutes.
Impact
- Delegates on Prod4 were unable to communicate with the Harness platform during the incident window.
- Pipeline executions requiring delegate tasks were blocked and could not progress.
- CI pipelines requiring secret resolution failed for the duration of the incident.
- Platform API and UI operations on Prod4 returned errors.
- Delegates reconnected automatically once service was restored — no manual restart was required.
Root Cause
A scheduled platform upgrade introduced a fixed connection limit in the internal routing layer that handles all delegate-to-platform traffic on Prod4. At the same time, the delegate management service underwent a rolling update that caused active delegates to reconnect simultaneously. The volume of concurrent reconnections exceeded the new fixed limit, blocking delegates from reaching the platform for the duration of the incident.
The previous routing configuration used an unbounded connection queue, which could absorb reconnection bursts of this nature without impact. The new fixed limit, sized for steady-state traffic, had no headroom for the reconnection surge produced by a concurrent rolling update.
Mitigation
Immediate:
Rolled back the routing component to the previous version, restoring the unbounded connection configuration and allowing all delegates to reconnect.
Short-term:
- Connection capacity is was increased and tuned to handle full delegate reconnection bursts with sufficient headroom above steady-state load.
- Connection acquire timeout is was extended so that temporary overload conditions resolve naturally rather than cascading into a self-sustaining failure.
- The routing component is being moved to its own independent release pipeline, decoupled from the main service upgrade, with dedicated post-deploy validation before traffic is promoted.
Ongoing:
- The delegate management service rolling update policy is being updated to stagger pod replacement one at a time, limiting the maximum reconnection burst to a fraction of the delegate fleet rather than the entire fleet simultaneously.
- Routing layer autoscaling limits are being raised so the cluster can expand connection capacity in response to load spikes during deployments.
Action Items
To prevent such issues from happening again, we are:
- Increasing the connection pool capacity to handle the full delegate reconnection burst with headroom.
- Extending connection acquire timeout to prevent transient overload from becoming a self-sustaining failure loop.
- Update the delegate management service rolling update configuration to replace one pod at a time.
- Update the Deployment mechanism so that we can deploy the routing component independently from the main service release pipeline with dedicated validation.
- Raise routing layer autoscaling limits to allow capacity expansion during connection load spikes.