Customers experienced pipeline failures due to intermittent errors when submitting delegate tasks. The issue was identified by the error message:
UNAVAILABLE: Connection closed after GOAWAY. HTTP/2 error code: NO_ERROR, debug data: max_age
Timeline
Time (UTC) | Event |
---|---|
06:00 AM | First occurrence of the issue. |
06:06 AM | Alert from our monitoring system received; team started investigating. |
06:42 AM | Service instances scaled up to restore service. |
07:00 AM | Functionality enabled (which was already being rolled out behind a Feature Flag) to prevent the reoccurrence of the issue. |
RCA
Pipeline execution functionality was degraded due to exhaustion of thread pool resources (responsible for secret resolution from custom secret manager). Trigger was pipeline run with a large number of secrets, which overwhelmed the thread pool responsible for resolving secrets. This reduced the capacity of the system, resulting in a build-up of delegate tasks awaiting submission. Eventually, those requests timed out, leading to pipeline failures.
Once issue got identified, we immediately scaled up our service infrastructure to handle the increased load. Subsequently, a feature flag to optimize secrets resolution flow was enabled. (This feature flag was in process to be enabled across all Harness environments in the next few days).
Actions Items