On November 11th at 8:08 PM IST, Harness internal systems alerted were triggered for for pipeline service pod restarts, indicating an uptime degradation. Investigation showed that three pipeline service pods began experiencing Java Heap OOM (Out of Memory) issues around 7:56 PM IST. These OOM events caused the affected pods to restart and led to delays in pipeline operations. As a result, some customers in the Prod-2 environment may have observed executions entering a stuck state or taking longer than expected.
Based on log analysis, the following user-facing symptoms may have occurred intermittently:
The issue was isolated to the three impacted pods; all other healthy pods continued to serve traffic normally, limiting the overall customer impact.
The root cause was a Java Heap Out of Memory condition in three pipeline service pods. Increased memory usage led these pods to exceed their allocated heap limits, resulting in OOMKills. When these pods restarted, in-flight requests were interrupted, and new requests routed to the impacted pods experienced failures or delays until the pods recovered. The unhealthy pods temporarily impacted execution and listing endpoints until the system rerouted traffic to healthy replicas.
To be proactive and prevent such issues from happening,