Pipelines custom webhook executions are delayed

Incident Report for Harness

Postmortem

Summary

Custom webhook triggers observed delayed execution due to a surge in incoming trigger executions that created a backlog for processing these types of trigger executions. This only impacted delayed executions via custom webhook triggers. The executions via api and UI were not impacted.

What was the issue?

Harness received a surge of custom webhook events for processing triggers. These triggers were executing git backed pipelines that were taking longer than usual to resolve which caused the back pressure on the trigger processing leading to delays for pipeline executions. This happened since we have a limited number of resources available for processing custom webhook types of triggers.

Resolution

We increased the resources on our systems to manage the surge which helped bring the system back to normal.

Timeline

Time (UTC) Event
Dec 20th 05:15pm Identified the system is observing some delays in processing triggers.
Dec 20th 05:50pm Identified the issue causing the delays.
Dec 20th 06:10pm Increased the available resources for processing triggers.
Dec 20th 07:25pm Incident was identified as resolved.

RCA

The currently allocated resources were unable to process the large number of custom webhooks leading to delays in processing them and thereby causing delayed pipeline executions. As a result, we had to allocate additional resources.

Action Items

  1. We have increased the number of threads that are assigned to process the custom webhooks.
  2. We will be working on enhancing the business logic to de-couple the pipeline resolution from custom webhook trigger processing flow.
Posted Jan 02, 2025 - 17:52 PST

Resolved

This incident has been resolved. We regret the inconvenience and will be providing an RCA for review.
Posted Dec 20, 2024 - 11:48 PST

Monitoring

We have mitigated the issue at this time and are continuing to monitor the iterator queue. The iterator queue will gradually clear off and the webhook queue will clear off.
Posted Dec 20, 2024 - 11:45 PST

Update

We are continuing to make progress and have partially mitigated the issue.
Posted Dec 20, 2024 - 11:39 PST

Update

We are continuing to work on a fix for this issue. Thank you for your patience!
Posted Dec 20, 2024 - 10:59 PST

Identified

We have identified a potential cause of the service issues and are working hard to address it. Please continue to monitor this page for updates.
Posted Dec 20, 2024 - 10:28 PST

Update

We are actively investigating the issue. Thank you for your patience!
Posted Dec 20, 2024 - 09:59 PST

Investigating

We are currently investigating the issue
Posted Dec 20, 2024 - 09:15 PST
This incident affected: Prod 1 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO)).