Pipeline Execution Slowdowns — A Subset of Customers May Notice Delayed Starts

Incident Report for Harness

Postmortem

Summary

Between November 26th and 28th, 2025, customers using Harness Pipelines in Prod2 and Prod3 experienced degraded performance including longer queue times, delays between pipeline stages, and some execution failures.

‌

Affected Time Periods:

Prod2: November 26th (10:59 AM - 1:36 PM PST) and November 28th (2:30 AM - 11:40 AM PST)
Prod3: November 28th (2:30 AM - 03:14 AM PST)

Root Cause

During pipeline execution, Harness publishes execution data to our dashboards service for visualization. On November 26th, we deployed an update that introduced an incompatible change to the Avro data serialization format. This caused the data publishing process to fail silently.

‌

As these timeout operations accumulated across hundreds of executing pipelines, they consumed the available worker threads in the pipeline execution service. When the thread pool became exhausted, new pipeline requests had to wait for threads to become available, resulting in the performance degradation observed across the platform.

Customer Impact

Customers experienced the following during this incident:

Longer wait times for pipelines to start
Noticeable delays between stages and steps during execution
In some cases, pipelines failed or were aborted mid-execution
A small number of pipelines became stuck in a running state without releasing resources

Resolution

We identified and fixed the incompatible code, then deployed the corrected version to Prod2 and Prod3. This resolved the issue, and pipeline executions returned to normal performance levels.

Prevention and Improvements

To prevent similar issues in the future, we are taking the following actions:

Expanding our automated testing to detect backward compatibility issues before deployment
Adding enhanced monitoring for the event publishing system to catch failures earlier
Implementing better backpressure handling to prevent thread pool exhaustion

Posted Dec 04, 2025 - 11:27 PST

Resolved

This incident has been resolved.

Posted Nov 26, 2025 - 13:31 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 26, 2025 - 12:44 PST

Update

We are continuing to work on a fix for this issue.

Posted Nov 26, 2025 - 11:32 PST

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 26, 2025 - 11:18 PST

Investigating

We are currently investigating this issue.

Posted Nov 26, 2025 - 11:12 PST

This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO), Infrastructure as Code Management (IaCM)).