Users in Prod-2 cluster facing unexpected pipeline failures

Incident Report for Harness

Postmortem

What was the issue?

Customers experienced pipeline failures due to intermittent errors when submitting delegate tasks. The issue was identified by the error message:

UNAVAILABLE: Connection closed after GOAWAY. HTTP/2 error code: NO_ERROR, debug data: max_age

Timeline

Time (UTC) Event
06:00 AM First occurrence of the issue.
06:06 AM Alert from our monitoring system received; team started investigating.
06:42 AM Service instances scaled up to restore service.
07:00 AM Functionality enabled (which was already being rolled out behind a Feature Flag) to prevent the reoccurrence of the issue.

RCA

Pipeline execution functionality was degraded due to exhaustion of thread pool resources (responsible for secret resolution from custom secret manager). Trigger was pipeline run with a large number of secrets, which overwhelmed the thread pool responsible for resolving secrets. This reduced the capacity of the system, resulting in a build-up of delegate tasks awaiting submission. Eventually, those requests timed out, leading to pipeline failures.

Once issue got identified, we immediately scaled up our service infrastructure to handle the increased load. Subsequently, a feature flag to optimize secrets resolution flow was enabled. (This feature flag was in process to be enabled across all Harness environments in the next few days).

Actions Items

  1. Roll out the feature in all environments. (done)
  2. Enforce limit on number of simultaneous secret resolutions in a pipeline execution.
Posted Feb 16, 2025 - 21:11 PST

Resolved

This incident has been resolved. Please monitor this incident for Postmortem report.
Thanks for your patience.
Posted Feb 05, 2025 - 23:53 PST

Monitoring

The issues is mitigated now , We are actively monitoring it.
Posted Feb 05, 2025 - 22:57 PST

Investigating

We are currently investigating as issue in our Prod-2 cluster with CD Pipelines.
Posted Feb 05, 2025 - 22:29 PST
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG)).