Pipelines Execution encountered issues

Incident Report for Harness

Postmortem

Summary:

Pipeline executions across CI/CD are not advancing as anticipated with some even getting stuck in the prod2 cluster.

What was the issue?

One of the micro services was running low on resources due to a pipeline execution consuming more resources when attempting to resolve expressions. This led to all pipelines running slower by (80%) during the duration of the incident with few executions getting into an unresponsive state.

Timeline:

Resolution:

The pipeline execution that was consuming more resources was aborted and the service pods were restarted to recover the system.

RCA

A pipeline contained circular references in the files. The runtime resolution of these references resulted in excessive threads getting into the Waiting state. Although the issue was automatically detected (and the system was auto-protected by breaking the circuit), the excessive threads were still consumed due to a higher threshold on the loop detection logic.

Since then, we have further reduced the loop threshold and automated blocking such runaway pipeline executions.

Posted Jun 05, 2024 - 11:46 PDT

Resolved

We can confirm normal operation. Get Ship Done!
We will continue to monitor and ensure stability.

Posted Jun 03, 2024 - 17:32 PDT

Update

We are continuing to monitor for any further issues.

Posted Jun 03, 2024 - 17:27 PDT

Monitoring

Harness service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.

Posted Jun 03, 2024 - 17:17 PDT

Identified

We have identified a potential cause of the pipleine service issues and are working hard to address it. Please continue to monitor this page for updates.

Posted Jun 03, 2024 - 16:30 PDT

This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Security Testing Orchestration (STO)).