Harness performance degraded in Prod3.

Incident Report for Harness

Postmortem

Summary

Prod3 environment observed slowness and degraded performance across the board and saw intermittent MongoDB errors while accessing the application

‌What was the issue?

On Tuesday, February 4th at 9:30 AM UTC, we observed a sudden spike in MongoDB utilization, reaching 90% CPU usage, which resulted in degraded cluster performance. This surge led to blocked DB connections due to the load, causing multiple queries to starve for connections and impacting user experience across the platform.

Resolution

  • The team investigated the issue and identified that degradation on MongoDB as the root cause.
  • Increased the resources for the Mongo DB which in turn helped in processing the load which helped the system to get back to its regular operation

‌Timeline

Time (UTC) Event
4th Feb 9:30 AM CPU Usage on MongoDB went up and persisted
4th Feb 10:30 AM Mongo connections spiked, application retries increased
4th Feb 10:30 AM Cache dirty fill ratio on Mongo became and persisted at 20%. Writes to MongoDB started to fail
4th Feb 11:40 AM Upgraded MongoDB instance tier to M40 to handle the load
4th Feb 12:00 PM System stabilized

RCA

We observed that MongoDB CPU utilization spiked. We also observed that MongoDB memory utilization exceeded 90%, leading to an increase in system error rates. During this period, the cache dirty fill ratio started rising and surpassed 20%, remaining elevated.  At this point, Mongo application threads became involved in eviction processes instead of executing usual database operations such as CRUD actions, replication, and other core functions.

This shift in thread activity caused operations to stall, leading to excessive memory consumption across the nodes. As system memory utilization increased, overall database performance degraded, further compounding the issue.

To mitigate the impact, we performed a cluster tier upscale on 02/04/25 at 03:40 AM PST, which successfully alleviated the memory pressure on the affected nodes. Following the upgrade, we observed that system performance returned to acceptable levels.

Action Items

Since the incident, the team has been actively working on several action items to prevent similar occurrences, including:

  • Query Optimization: Identifying and optimizing slow-running queries to reduce load on the database.
  • Scaling Strategy: Evaluating a proactive cluster tier auto scaling approach to handle traffic spikes efficiently.
  • Monitoring & Alerts: Enhancing monitoring to detect query bottlenecks earlier and prevent performance degradation.
Posted Feb 10, 2025 - 15:58 PST

Resolved

This incident is resolved now.
Please monitor this page for postmortem report.
Posted Feb 04, 2025 - 05:07 PST

Monitoring

The issue has been mitigated and we are currently monitoring the system.
Posted Feb 04, 2025 - 04:00 PST

Investigating

We are actively investigating the service degradation issue in the prod3 environment.
Posted Feb 04, 2025 - 03:23 PST
This incident affected: Prod 3 (Continuous Delivery - Next Generation (CDNG)).