Prod2: Multiple Pipeline executions are getting stuck

Incident Report for Harness

Postmortem

Summary

Between December 29th 2025 02:45 A.M to 09:44 A.M, PT customers using Harness Pipelines in Prod2 experienced few stuck executions.

Root Cause

On November 29th, we deployed a configuration change intended to improve system performance and observability. This change modified the processing model of a failure strategy related event, making it asynchronous.Inadvertently, due to this change, existing idempotent logic incorrectly classified valid failure strategy events as duplicates. As a result, these events were not processed, causing a small number of pipeline executions that triggered failure strategies to become stuck.

Impact

Customers experienced the following during this incident:

  • Pipeline executions become stuck after triggering a failure strategy.
  • Pipeline executions with exclusive lock (Queue and Deployment steps) waiting on previous ones would wait until Harness aborted the stuck executions.
  • No other areas of harness were impacted. 

Mitigation

We identified the configuration change as the root cause and reverted it. The corrected configuration was deployed to Prod2, restoring normal pipeline behavior.

Following the fix, Harness ran a cleanup job in the subsequent hours to identify and abort any remaining stuck executions.

Prevention and Improvements

To prevent similar issues in the future, we are taking the following actions:

  • Expanding our automated job to detect stuck executions
  • Rolling out the newly built enhanced monitoring for the event publishing system to catch failures earlier
Posted Jan 07, 2026 - 21:40 PST

Resolved

Issue has been resolved.
Posted Dec 29, 2025 - 10:52 PST

Update

We are continuing to monitor for any further issues.
Posted Dec 29, 2025 - 09:18 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Dec 29, 2025 - 08:38 PST

Update

The issue has been resolved, and we’re currently aborting the stuck executions
Posted Dec 29, 2025 - 07:42 PST

Update

Issue has been identified and a fix is being rolled out. New Pipelines should run fine and existing stuck pipelines are being aborted.
Posted Dec 29, 2025 - 07:34 PST

Identified

The issue has been identified and a fix is being implemented.
Posted Dec 29, 2025 - 07:32 PST

Update

We are continuing to investigate this issue.
Posted Dec 29, 2025 - 07:23 PST

Investigating

We are currently investigating this issue.
Posted Dec 29, 2025 - 07:22 PST
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO)).