Customers may experience intermittent “stuck” pipelines in Prod 2

Incident Report for Harness

Postmortem

Summary

On November 11th at 8:08 PM IST, Harness internal systems alerted were triggered for for pipeline service pod restarts, indicating an uptime degradation. Investigation showed that three pipeline service pods began experiencing Java Heap OOM (Out of Memory) issues around 7:56 PM IST. These OOM events caused the affected pods to restart and led to delays in pipeline operations. As a result, some customers in the Prod-2 environment may have observed executions entering a stuck state or taking longer than expected.

Impact

Three pipeline service pods restarted due to Java Heap OOM, which led to intermittent disruptions in pipeline-related operations.
Based on log analysis, the following user-facing symptoms may have occurred intermittently:
- Execution List / Execution View pages failing to load.
- Pipeline List page failing to load.
- Some executions entering a stuck state or failing during the period when OOM pods were serving requests.
The issue was isolated to the three impacted pods; all other healthy pods continued to serve traffic normally, limiting the overall customer impact.

Root Cause

The root cause was a Java Heap Out of Memory condition in three pipeline service pods. Increased memory usage led these pods to exceed their allocated heap limits, resulting in OOMKills. When these pods restarted, in-flight requests were interrupted, and new requests routed to the impacted pods experienced failures or delays until the pods recovered. The unhealthy pods temporarily impacted execution and listing endpoints until the system rerouted traffic to healthy replicas.

Action Items

To immediately solve the issue , Increase or right-size Java Heap allocation for the pipeline service pods to prevent future OOM conditions.
Add enhanced memory utilization monitoring and alerts to detect early signs of heap pressure.
To be proactive and prevent such issues from happening,
1. Perform heap profiling on pipeline service workloads to identify memory hotspots or leaks.
2. Implement circuit-breaking or automated eviction of unhealthy pods to prevent traffic from being routed to pods entering OOM states.
3. Improve autoscaling thresholds to ensure new pods are provisioned before resource saturation occurs.

Posted Nov 30, 2025 - 22:25 PST

Resolved

This incident has been resolved.

Posted Nov 11, 2025 - 13:00 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 11, 2025 - 12:05 PST

Update

Rerunning pipelines may be successful. Harness is currently investigating

Posted Nov 11, 2025 - 10:29 PST

Investigating

We are currently investigating an issue that started at 6:41AM PT

Posted Nov 11, 2025 - 10:19 PST

This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG)).