We want to share the details about the pipeline service issue observed in our Prod2 cluster that impacted our customers on the 15th of march starting at 10:55 PDT.
Harness pipeline failures do not stop pipeline execution immediately and keep the pipeline running until a task or pipeline timeout occurs. No customer reported being impacted by this issue.
A code change in the latest version of Harness caused a clean-up task iterator to fail long-queued delegate tasks. Long-running delegate failure signals were not picked up without the clean-up task. When a delegate task failed, the pipeline remained in a running state until another timeout happened.
All times are in PDT on March 15, 2023.
21:30: Harness released a new version to the Prod 2 cluster.
21:49: An increased number of delegate tasks has been identified via an automated alert. The harness engineering and ops teams started an investigation on this spike.
22:55: We have determined that this may impact customers and updated the Harness status page accordingly.
23:02: Potential root cause was identified, and work on fix and validation started.
23:25: root cause was confirmed. The status page was updated.
23:52: Hot Fix deployment started.
23:59: deployment was complete, and the fix was validated. The status page was updated accordingly.
A rollback was determined not to be the right solution as it would not solve the problem of stale tasks in the database. Fix forward was picked up, and a hotfix was created and deployed.