CI/CD/STO pipelines are experiencing slowness or some executions are stuck

Incident Report for Harness

Postmortem

Summary

Pipeline executions across CI/CD/STO are not advancing as anticipated, with some even getting stuck across various customers in the prod2 cluster.

Timelines

Time (PST)	Event
08:50 am	System instability alert was received and investigation was initiated.
09:20 am	Identified the culprit configuration that led to increased load on our systems.
10:00 am	Increased resource allocation to help with increased load.
10:15 am	Updated the invalid configuration from the system to make it valid.
11:00 am	Systems back to normal

RCA

The Harness pipeline engine functions within a microservice ecosystem, working alongside various framework components to manage expressions. These expressions often involve variables and configuration files, which can be stored in Git repositories. One of these configuration file contained a self-referential expression. This recursive reference repeatedly called for the resolution of the same configuration file, triggering a loop that led the service to exhaust its resources.

Resolution

We've refactored the configuration to remove the recursive reference and restarted the service. Additionally, we've deployed hotfixes to prevent the reintroduction of such configurations and implemented mechanisms to auto-detect and halt recursion within the service.

Additional Action Items

To expedite the RCA and mitigate incidents promptly, we're implementing additional logging and alerting mechanisms to detect specific instabilities. This will enhance our ability to identify and address issues swiftly.

Posted Mar 27, 2024 - 18:10 PDT

Resolved

This incident has been resolved. All pipelines should be running with normal latency.

Posted Mar 26, 2024 - 11:07 PDT

Monitoring

Pipelines are running with healthy metrics and we are currently monitoring the systems

Posted Mar 26, 2024 - 10:31 PDT

Update

We have identified a possible cause, and are continuing to investigate the source of the issue.

Posted Mar 26, 2024 - 10:15 PDT

Investigating

Our engineering teams have discovered the pipelines are not running smoothly and are working towards identifying the issue at the earliest possible.

Posted Mar 26, 2024 - 09:00 PDT

This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG)).