During the incident, customers experienced delays in warmup and cooldown operations for AutoStopping rules
On Sept 16 at 2:40 AM PDT, We had a surge in resource intensive jobs which resulted a drop in system throughput as it consumed capacity on the backend systems.
To restore normal operations quickly, we scaled up the number of background workers , which successfully drained the queue and stabilised the system.
We are implementing the following improvements to prevent recurrence:
Dedicated workers for high-priority queues
Enhanced AutoScaling of workers
Server-side rate limiting
Account-specific queues