Prod-2 pipeline execution list page failing intermittently

Incident Report for Harness

Postmortem

Incident Summary

Between 1:04 PM and 2:00 PM IST on October 4th, customers in the Prod-2 environment may have experienced intermittent issues accessing the Execution List page.

Root Cause

One of our service pods experienced abnormally high memory usage, leading it to stop processing new requests effectively. As a result, any requests routed to this specific pod timed out, while requests directed to healthy pods continued to succeed — causing intermittent failures.

Impact

  • The Execution List page may have failed to load intermittently.
  • The issue was limited to a single faulty pod; other pods continued to serve requests normally.
  • No data loss or long-term impact to pipeline executions was observed.

Remediation

The affected pod was immediately removed from the cluster, allowing traffic to be routed only to healthy pods. This restored service stability, and normal operations resumed.

Preventive Action

To prevent similar occurrences in the future, we are implementing the following:

  1. Pod Recycling Strategy:
* We will  be propactively and aggressively looking for pod health and **recycling pods as necessary to maintain performance and reliability.**
  1. Optimization of Pod and services.
* Review and optimize memory utilization patterns within the services.
* Tune memory allocation for service pods to ensure better stability under high load
Posted Oct 08, 2025 - 10:20 PDT

Resolved

This incident has been resolved. Thank you for your patience.
Posted Oct 04, 2025 - 01:42 PDT

Investigating

We have noticed reports of intermittent failures with pipeline executions list page. The pipeline execution and other functionalities are working as expected. Please monitor this page for further updates on this issue.
Posted Oct 04, 2025 - 01:30 PDT
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG)).