Pipeline executions across CI/CD are not advancing as anticipated with some even getting stuck in the prod2 cluster.
One of the micro services was running low on resources due to a pipeline execution consuming more resources when attempting to resolve expressions. This led to all pipelines running slower by (80%) during the duration of the incident with few executions getting into an unresponsive state.
The pipeline execution that was consuming more resources was aborted and the service pods were restarted to recover the system.
A pipeline contained circular references in the files. The runtime resolution of these references resulted in excessive threads getting into the Waiting state. Although the issue was automatically detected (and the system was auto-protected by breaking the circuit), the excessive threads were still consumed due to a higher threshold on the loop detection logic.
Since then, we have further reduced the loop threshold and automated blocking such runaway pipeline executions.