Prod3 experiences slowness in pipelines

Incident Report for Harness

Postmortem

Summary

A rollout involving OpenTelemetry instrumentation changes introduced a memory leak in the OTEL eBPF collector running in production clusters. Under sustained production traffic, the leak caused increasing JVM heap utilization, elevated garbage collection pressure, and eventual out-of-memory (OOM) conditions across several core platform services.

Impact

  • Elevated latency and intermittent instability in Prod1, Prod2, and Prod3
  • Some customers experienced slow pipeline execution and degraded responsiveness

No customer data loss occurred.

Root Cause

The root cause was an upstream defect in the OpenTelemetry eBPF instrumentation library that introduced a memory leak under production-scale workloads. The leak continuously increased telemetry-related memory consumption, leading to sustained JVM garbage collection pressure and eventual heap exhaustion.

Mitigation and Recovery

Immediate Actions

  • scaled up clusters to stabilize impacted clusters
  • Disabled OTEL instrumentation components and restarted affected services

Next Steps

To prevent such issues from happening again, we are:

  • Enhance our load testing process to test in higher workloads to identify such issues prior to going production.
  • Add additional granular instrumentation to catch such issues sooner.
Posted May 11, 2026 - 14:09 PDT

Resolved

This incident has been resolved.
Posted May 01, 2026 - 13:10 PDT

Update

We are largely mitigated and most pipelines are running normally. We are monitoring all parameters to make sure there are no issues before closing it.
Posted May 01, 2026 - 12:58 PDT

Update

We are continuing to monitor for any further issues.
Posted May 01, 2026 - 10:15 PDT

Update

We are continuing to monitor for any further issues.
Posted May 01, 2026 - 10:14 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted May 01, 2026 - 09:56 PDT

Investigating

We are currently investigating this issue.
Posted May 01, 2026 - 09:28 PDT
This incident affected: Prod 3 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, FME).