Pipeline executions across CI/CD/STO are not advancing as anticipated, with some even getting stuck across various customers in the prod2 cluster.
Time (PST) | Event |
---|---|
08:50 am | System instability alert was received and investigation was initiated. |
09:20 am | Identified the culprit configuration that led to increased load on our systems. |
10:00 am | Increased resource allocation to help with increased load. |
10:15 am | Updated the invalid configuration from the system to make it valid. |
11:00 am | Systems back to normal |
The Harness pipeline engine functions within a microservice ecosystem, working alongside various framework components to manage expressions. These expressions often involve variables and configuration files, which can be stored in Git repositories. One of these configuration file contained a self-referential expression. This recursive reference repeatedly called for the resolution of the same configuration file, triggering a loop that led the service to exhaust its resources.
We've refactored the configuration to remove the recursive reference and restarted the service. Additionally, we've deployed hotfixes to prevent the reintroduction of such configurations and implemented mechanisms to auto-detect and halt recursion within the service.
To expedite the RCA and mitigate incidents promptly, we're implementing additional logging and alerting mechanisms to detect specific instabilities. This will enhance our ability to identify and address issues swiftly.