Harness Service facing issues with pipelines in prod 2

Incident Report for Harness

Postmortem

We want to share the details about the pipeline service issue observed in our Prod2 cluster that impacted our customers on the 15th of march starting at 10:55 PDT.

Impact:

Harness pipeline failures do not stop pipeline execution immediately and keep the pipeline running until a task or pipeline timeout occurs. No customer reported being impacted by this issue.

Root Cause:

A code change in the latest version of Harness caused a clean-up task iterator to fail long-queued delegate tasks. Long-running delegate failure signals were not picked up without the clean-up task. When a delegate task failed, the pipeline remained in a running state until another timeout happened.

Incident timeline:

All times are in PDT on March 15, 2023.

21:30: Harness released a new version to the Prod 2 cluster.

21:49: An increased number of delegate tasks has been identified via an automated alert. The harness engineering and ops teams started an investigation on this spike.

22:55: We have determined that this may impact customers and updated the Harness status page accordingly.

23:02: Potential root cause was identified, and work on fix and validation started.

23:25: root cause was confirmed. The status page was updated.

23:52: Hot Fix deployment started.

23:59: deployment was complete, and the fix was validated. The status page was updated accordingly.

Remediation:

A rollback was determined not to be the right solution as it would not solve the problem of stale tasks in the database. Fix forward was picked up, and a hotfix was created and deployed.

Action items:

Short Term: We failed to identify this issue due to configuration differences between the test and production environments. Our first step was to match the two environments, not to repeat the same problem.
Long Term: We have reviewed and updated our change process to ensure that environments are never out of sync.

Posted Mar 16, 2023 - 18:25 PDT

Resolved

We can confirm normal operation. Get Ship Done!
We will continue to continue to monitor and ensure stability.

Posted Mar 16, 2023 - 00:16 PDT

Monitoring

Harness service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.

Posted Mar 15, 2023 - 23:59 PDT

Identified

Pipelines which should be failing immediately due to misconfiguration will time out instead of failing. We have identified a potential cause of the service issues and are working hard to address them. Please continue to monitor this page for updates.

Posted Mar 15, 2023 - 23:25 PDT

Update

We are continuing to investigate this issue.

Posted Mar 15, 2023 - 23:01 PDT

Investigating

We are currently investigating this issue.

Posted Mar 15, 2023 - 22:55 PDT

This incident affected: Prod 2 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM)).