Summary
On December 1–2, pipelines in the production environment experienced delays and intermittent stalls. A recent infrastructure update increased message-processing volume, and under sustained load the system’s underlying key-value datastore reached its memory limit. Once this occurred, essential message-processing operations began failing intermittently, which caused some pipeline steps and event-driven triggers to stop progressing.
Impact
- Some pipeline executions became stuck or significantly delayed.
- Event-driven triggers experienced intermittent failures, resulting in delayed or missed workflow initiations.
- Customers observed degraded pipeline performance during the incident window.
Root Cause
The incident was caused by the caching system’s memory being consumed faster than entries were cleaned up. Temporary deduplication records were being created at a much higher rate than the cleanup process could remove them.
When the caching system reached its memory limit:
- Insert operations required for message deduplication began failing.
- Some messages were skipped to avoid duplicate processing, causing pipelines to stall.
This combination resulted in pipelines and triggers not progressing as expected.
Next Steps / Remediation
Immediate Remediation
- Increased capacity for the caching system to restore normal operations
- Increased the cleanup rate for temporary deduplication records to prevent memory buildup.
- Added defensive handling so consumers continue running even when invalid responses are returned by the datastore.
Permanent Remediation
To prevent this from happening again, we are
- improving resilience of message-processing components to prevent stalls if the underlying datastore becomes slow or unstable.
- Optimizing the architecture to reduce reliance on temporary entries in the key-value store for high-volume operations.
To be proactive and react faster, we are
- Implementing enhanced monitoring to detect stalled consumers or unusual growth patterns in the caching layer.
- Add alerting for memory usage and processing latency in the caching subsystem.