Webhooks are degraded in prod1

Incident Report for Harness

Postmortem

Summary

On December 1–2, pipelines in the production environment experienced delays and intermittent stalls. A recent infrastructure update increased message-processing volume, and under sustained load the system’s underlying key-value datastore reached its memory limit. Once this occurred, essential message-processing operations began failing intermittently, which caused some pipeline steps and event-driven triggers to stop progressing.

Impact

Some pipeline executions became stuck or significantly delayed.
Event-driven triggers experienced intermittent failures, resulting in delayed or missed workflow initiations.
Customers observed degraded pipeline performance during the incident window.

Root Cause

The incident was caused by the caching system’s memory being consumed faster than entries were cleaned up. Temporary deduplication records were being created at a much higher rate than the cleanup process could remove them.

When the caching system reached its memory limit:

Insert operations required for message deduplication began failing.
Some messages were skipped to avoid duplicate processing, causing pipelines to stall.

This combination resulted in pipelines and triggers not progressing as expected.

Next Steps / Remediation

Immediate Remediation

Increased capacity for the caching system to restore normal operations
Increased the cleanup rate for temporary deduplication records to prevent memory buildup.
Added defensive handling so consumers continue running even when invalid responses are returned by the datastore.

Permanent Remediation

To prevent this from happening again, we are
- improving resilience of message-processing components to prevent stalls if the underlying datastore becomes slow or unstable.
- Optimizing the architecture to reduce reliance on temporary entries in the key-value store for high-volume operations.
To be proactive and react faster, we are
- Implementing enhanced monitoring to detect stalled consumers or unusual growth patterns in the caching layer.
- Add alerting for memory usage and processing latency in the caching subsystem.

Posted Dec 09, 2025 - 22:44 PST

Resolved

This incident has been resolved.

Posted Dec 03, 2025 - 12:14 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Dec 02, 2025 - 20:12 PST

Investigating

We are currently investigating this issue.

Posted Dec 02, 2025 - 20:01 PST

This incident affected: Prod 1 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA)).