Summary
On February 26, 2026, multiple customers experienced disruptions accessing Harness on Prod1 and Prod2. A transient network connectivity issue caused disruption to our backend systems , leading to platform unresponsiveness. Service was restored within approximately one hour.
Impact
- Customers on Prod2 were unable to log in or access the Harness platform.
- Prod1 experienced login disruptions due to a cross-environment dependency on Prod2.
- Delegates disconnected; Kubernetes-based delegates reconnected automatically, while non-Kubernetes delegates required a manual restart.
Root Cause
A transient network connectivity disruption caused connection timeouts across the platform. The exact infrastructure-side trigger of the initial connectivity disruption is still under investigation.
Remediation
- Immediate: Affected services were manually restarted, clearing stuck connections and restoring platform availability.
- Short-term: Autoscaling limits were adjusted to better handle sudden reconnection load.
- Ongoing: Investigation into timeout configuration and application resilience improvements is in progress.
Action Items
To prevent such issues from happening again
- Review and update the timeouts settings to fail fast and limit thread blocking during connectivity issues.
- Improve application resilience — enhance circuit breakers to prevent connectivity issues and retries