Few customers experienced delays in warmup and cooldown operations for AutoStopping rules

Incident Report for Harness

Postmortem

Summary

During the incident, customers experienced delays in warmup and cooldown operations for AutoStopping rules

Root cause

On Sept 16 at 2:40 AM PDT, We had a surge in resource intensive jobs which resulted a drop in system throughput as it consumed capacity on the backend systems.

Immediate Resolution

To restore normal operations quickly, we scaled up the number of background workers , which successfully drained the queue and stabilised the system.

Preventive Action Items

We are implementing the following improvements to prevent recurrence:

Dedicated workers for high-priority queues
- Ensuring low-priority jobs cannot occupy the full worker capacity
Enhanced AutoScaling of workers
- Scaling will be based not only on CPU/memory but also on queue size and job latency. This will allow faster response to surges in job volume.
Server-side rate limiting
- Duplicate requests will be discarded, preventing the system from being flooded by redundant jobs.
Account-specific queues
- Isolating jobs at the account level, so that heavy activity from one account cannot degrade performance for others.

Posted Sep 18, 2025 - 13:53 PDT

Resolved

On Sept 16 at 2:40 AM PDT, Due to spike in resource intensive operations resulted in a queue buildup increasing contention on the worker queue and reducing the overall system throughput.

Posted Sep 16, 2025 - 04:30 PDT