Hosted CI Builds Failing Intermittently

Incident Report for Harness

Postmortem

Summary

Between May 27 and June 1, 2026, some Harness CI customers experienced pipeline execution failures with the error:

failed to call LE.RetryStartStep: context deadline exceeded

Impact

Affected customers saw intermittent CI pipeline failures during step execution.

Existing pipeline definitions, customer data, source code, and artifacts were not impacted.

Root Cause

The root cause was a deadlock in the Light Engine logging path. When the log service returned an error, the Light Engine log writer attempted to reacquire a mutex it already held. This caused the Light Engine process to freeze, which led to step execution timeouts and pipeline failures.  

Mitigation and Resolution

Harness Engineering took multiple mitigation steps during the incident, including:

  • Rolled back affected runner versions where needed
  • Increased relevant timeout configurations
  • Reduced log-service load and latency
  • Temporarily disabled the affected livelog streaming path
  • Migrated selected workloads across regions and infrastructure providers
  • Pinned a fixed Light Engine version through runner configuration

The final fix addressed the mutex deadlock in the Light Engine log writer and prevented the same lock from being reacquired while already held.  

Prevention and Follow-Up Actions

Harness is taking the following actions to reduce recurrence risk:

  • Improve deadlock detection in critical concurrent code paths
  • Strengthen error handling for log-service interactions
  • Add better monitoring for Light Engine process health
  • Improve safeguards around logging-path failures
  • Continue reviewing runner rollout and validation processes
Posted Jun 04, 2026 - 08:32 PDT

Resolved

This incident has been resolved.
Posted May 27, 2026 - 18:48 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted May 27, 2026 - 16:53 PDT

Update

Harness is currently implementing a failover to mitigate the intermittent issue for some customers
Posted May 27, 2026 - 10:54 PDT

Update

Harness has implemented a change and are seeing failures reduced, but are continuing to work on completing mitigation. Customers can expect executions to succeed more frequently.

Prod1/Prod2 are back to normal.
Posted May 27, 2026 - 09:32 PDT

Update

Harness is continuing to investigate and implement changes to fully restore functionality. At this time, we are still seeing some CI Builds intermittently fail.
Posted May 27, 2026 - 08:20 PDT

Update

We are continuing to work on a fix for this issue.
Posted May 27, 2026 - 07:06 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted May 27, 2026 - 06:43 PDT

Investigating

We are currently investigating this issue.
Posted May 27, 2026 - 05:35 PDT
This incident affected: Prod 3 (Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds), Prod 1 (Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds), and Prod 2 (Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds).