Pipeline Steps Timing out for a subset of customers in Prod2
Incident Report for Harness
Postmortem

Summary: 

Pipeline executions were failing with a time-out error on Prod2. This affected ~3% of pipeline executions.

What was the issue?

Tasks are execution units that run on a delegate as part of a pipeline execution. As a pipeline runs, its tasks are broadcast to delegates, and one eligible delegate picks up the task for execution. In case any delegate does not acquire the task within the stipulated time, it is rebroadcast. During this incident, rebroadcast functionality was affected, resulting in pipeline executions getting timed out.

Resolution: 

We rolled back the service to resolve the issue.

RCA

An incompatibility change was rolled out in one of our micro-services, causing deserialization failure for a subset of task types. The rebroadcast threads went into an error state due to this deserialization error, resulting in the failure of pipelines that required task rebroadcasts. The system recovered upon the service's rollback.

Action Item

  1. Added a critical alert for rebroadcast events.
  2. Rebroadbast logic is made resilient to task deserialization errors.
  3. Unit Test added to catch incompatible contract changes for task data.
Posted Oct 30, 2024 - 21:30 PDT

Resolved
The incident has been resolved. We will be sharing a RCA with improvements in monitoring and other steps.
Posted Oct 14, 2024 - 10:25 PDT
Monitoring
The issue has been fixed and we are monitoring the system.
Posted Oct 14, 2024 - 10:08 PDT
Identified
The issue has been identified and we are still working on a fix.
Posted Oct 14, 2024 - 09:01 PDT
Investigating
We are currently investigating an issue where the clone codebase step is failing for a subset of customers in Prod2.
Posted Oct 14, 2024 - 08:08 PDT
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG)).