Hosted CI pipelines for few accounts in Prod1/Prod2 are queued

Incident Report for Harness

Postmortem

Summary

Between 8:30 AM and 11:30 AM PST on December 3rd, 2025, customers in the Prod-2 environment experienced issues where some CI builds remained stuck in the Queued state despite being under their concurrency limits. New builds continued running, but previously queued executions did not transition to Running until a fix was deployed.

Root Cause

A recently introduced change in the concurrency queuing logic caused stale messages remain in the Redis queue.These stale messages could not be processed because their corresponding executions had already been aborted and their metadata removed.he dequeue poller repeatedly attempted to acquire a lock for these messages and failed, placing them back at the head of the queue, preventing newer queued executions from being processed.

The missing fallback logic to clean up stale messages resulted in older builds remaining perpetually queued.

Impact

  • New builds continued to run normally, but older queued executions did not start until the fix was applied.
  • The issue was isolated to a subset of executions for a single customer; other customers and modules did not experience service degradation.

Remediation

Immediate:

  • Increased concurrency limits for the affected customer to ensure no new builds were queued.
  • Deployed a hotfix that cleaned up stale messages and allowed queued builds to transition to Running.

Permanent:

  • Updated the locking flow to detect when execution metadata is missing and safely clean up stale messages

Action Items

  1. Add fallback logic to safely delete stale queue messages when lock acquisition fails due to missing execution metadata.
  2. Add retries and improved cleanup handling when ACK operations fail.
  3. Enhance monitoring to detect queue buildup or repeated unack patterns earlier.
  4. Expand testing coverage for concurrency queue edge cases, including aborted pipelines and metadata cleanup races.
Posted Dec 04, 2025 - 14:08 PST

Resolved

This incident has been resolved.
Posted Dec 03, 2025 - 11:38 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Dec 03, 2025 - 09:34 PST

Identified

The issue has been identified and a fix is being implemented.
Posted Dec 03, 2025 - 08:48 PST

Investigating

We are currently investigating this issue.
Posted Dec 03, 2025 - 08:48 PST
This incident affected: Prod 1 (Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds).