Some customers on Prod1 may be experiencing degraded performance

Incident Report for Harness

Postmortem

Summary

On October 16th, 2024, our Prod1 environment experienced a significant increase in service response time and multiple 5xx errors. This led to degraded performance and outages for several services, including the NG-Manager pods, which went into an unhealthy state and restarted multiple times.

What caused the issue

The issue was caused by an overload on one of backend service database due to a large number of background tasks being re-assigned at once. This surge in tasks was triggered by delegate disconnections, which were caused by a spike in CPU usage on the Ingress pod.

The overload on the database led to:

  • Increased memory usage
  • Slow database queries
  • Service pods restarting due to unhealthy states

Resolution

The following steps were taken to mitigate the issue:

  1. Increased the size of the MongoDB instance.
  2. Stopped ~1200 background tasks that were running, which helped reduce the load on the database.

These actions led to system recovery, and the NG-Manager pods returned to a healthy state.

Follow-up Actions

To prevent similar issues in the future, we are implementing the following changes:

  • Improved Background Task Handling: Modify task reset jobs to depend on task heartbeat rather than delegate disconnection status.
  • MongoDB Autoscaling: Enable autoscaling for MongoDB to handle CPU and memory spikes.
  • Rate-limiting of Instance Sync Requests: Implement throttling to ensure the database is not overwhelmed during peak activity.
  • Enhanced Monitoring and Alerts: Add alerts for MongoDB resource usage and instance sync updates to catch potential issues earlier.
Posted Jan 27, 2025 - 13:13 PST

Resolved

This incident has been resolved.

We will provide an RCA after findings are complete.
Posted Oct 16, 2024 - 11:03 PDT

Monitoring

The issue has been mitigated.

We are still monitoring the system to ensure healthy operation of the cluster.
Posted Oct 16, 2024 - 10:49 PDT

Identified

We have identified the service that is causing the degradation. We have scaled up the DB resource for that service.

We are still working to mitigate the issue.
Posted Oct 16, 2024 - 10:37 PDT

Investigating

We have internally found an issue that is impacting the optimal performance for Prod1 customers. We are actively investigating this.
Posted Oct 16, 2024 - 10:09 PDT
This incident affected: Prod 1 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Mac Cloud Builds).