Summary
Between 8:30 AM and 11:30 AM PST on December 3rd, 2025, customers in the Prod-2 environment experienced issues where some CI builds remained stuck in the Queued state despite being under their concurrency limits. New builds continued running, but previously queued executions did not transition to Running until a fix was deployed.
Root Cause
A recently introduced change in the concurrency queuing logic caused stale messages remain in the Redis queue.These stale messages could not be processed because their corresponding executions had already been aborted and their metadata removed.he dequeue poller repeatedly attempted to acquire a lock for these messages and failed, placing them back at the head of the queue, preventing newer queued executions from being processed.
The missing fallback logic to clean up stale messages resulted in older builds remaining perpetually queued.
Impact
- New builds continued to run normally, but older queued executions did not start until the fix was applied.
- The issue was isolated to a subset of executions for a single customer; other customers and modules did not experience service degradation.
Remediation
Immediate:
- Increased concurrency limits for the affected customer to ensure no new builds were queued.
- Deployed a hotfix that cleaned up stale messages and allowed queued builds to transition to Running.
Permanent:
- Updated the locking flow to detect when execution metadata is missing and safely clean up stale messages
Action Items
- Add fallback logic to safely delete stale queue messages when lock acquisition fails due to missing execution metadata.
- Add retries and improved cleanup handling when ACK operations fail.
- Enhance monitoring to detect queue buildup or repeated unack patterns earlier.
- Expand testing coverage for concurrency queue edge cases, including aborted pipelines and metadata cleanup races.