Prod2 - Resource Constraint Issues

Incident Report for Harness

Postmortem

RCA:  Prod2 - Resource Constraint Issues

Summary: 

Pipeline executions were getting queued for multiple customers with the message "Current execution is queued as another execution is running with a given resource key". 

What was the issue?

Pipelines scheduled for execution were experiencing prolonged queuing delays. In certain cases, pipelines remained in the queued state long enough to eventually expire. This behavior impacted deployment pipelines as well as other pipelines incorporating a queue step, leading to execution delays and timeouts.

Resolution:

We found that a large number of resource restraint entries were created during pipeline runs. This buildup caused a backlog, which slowed down new pipeline processing. To mitigate the issue, we manually drained the queue. We also added capacity to help handle the load better and avoid the problem in the future.

RCA

Harness pipelines leverage resource restraint instances to control the number of concurrent pipeline executions. During the incident, an unexpected spike in load triggered the creation of significantly more instances than usual. As these are processed in the background at scheduled intervals, the sudden surge led to processing delays, causing pipelines to queue and resulting in slower execution times.

Action Items

  1. Harness is enhancing the internal management of resource locks to better support scaling and improve concurrency handling across pipelines.
  2. Monitoring will be strengthened to include alerts for delays in processing resource restraint instances, which would allow a quicker detection and response to similar issues moving forward.
Posted May 16, 2025 - 10:29 PDT

Resolved

We have successfully resolved the issue.
Posted May 10, 2025 - 12:22 PDT

Monitoring

Pipelines are executing successfully, we are monitoring further.
Posted May 10, 2025 - 12:13 PDT

Update

Mitigation efforts are still ongoing.
Posted May 10, 2025 - 11:47 PDT

Update

Mitigation progress is being made, though efforts are still ongoing at this time.
Posted May 10, 2025 - 11:11 PDT

Identified

A ResourceRestraintID lock is being held in a single customer's pipeline, causing other pipelines to be stuck. This issue is currently limited to a small number of customers, and we're working to mitigate it now.
Posted May 10, 2025 - 10:32 PDT

Investigating

We are currently investigating an issue with resource constraints in our Prod2 environment, which is causing stuck pipelines for some customers.
Posted May 10, 2025 - 09:44 PDT