On October 16th, 2024, our Prod1 environment experienced a significant increase in service response time and multiple 5xx errors. This led to degraded performance and outages for several services, including the NG-Manager pods, which went into an unhealthy state and restarted multiple times.
The issue was caused by an overload on one of backend service database due to a large number of background tasks being re-assigned at once. This surge in tasks was triggered by delegate disconnections, which were caused by a spike in CPU usage on the Ingress pod.
The overload on the database led to:
The following steps were taken to mitigate the issue:
These actions led to system recovery, and the NG-Manager pods returned to a healthy state.
To prevent similar issues in the future, we are implementing the following changes: