Between 1:04 PM and 2:00 PM IST on October 4th, customers in the Prod-2 environment may have experienced intermittent issues accessing the Execution List page.
One of our service pods experienced abnormally high memory usage, leading it to stop processing new requests effectively. As a result, any requests routed to this specific pod timed out, while requests directed to healthy pods continued to succeed — causing intermittent failures.
The affected pod was immediately removed from the cluster, allowing traffic to be routed only to healthy pods. This restored service stability, and normal operations resumed.
To prevent similar occurrences in the future, we are implementing the following:
* We will be propactively and aggressively looking for pod health and **recycling pods as necessary to maintain performance and reliability.**
* Review and optimize memory utilization patterns within the services.
* Tune memory allocation for service pods to ensure better stability under high load