Pipeline executions were failing with a time-out error on Prod2. This affected ~3% of pipeline executions.
Tasks are execution units that run on a delegate as part of a pipeline execution. As a pipeline runs, its tasks are broadcast to delegates, and one eligible delegate picks up the task for execution. In case any delegate does not acquire the task within the stipulated time, it is rebroadcast. During this incident, rebroadcast functionality was affected, resulting in pipeline executions getting timed out.
We rolled back the service to resolve the issue.
An incompatibility change was rolled out in one of our micro-services, causing deserialization failure for a subset of task types. The rebroadcast threads went into an error state due to this deserialization error, resulting in the failure of pipelines that required task rebroadcasts. The system recovered upon the service's rollback.
Action Item