Prod3 environment observed slowness and degraded performance across the board and saw intermittent MongoDB errors while accessing the application
On Tuesday, February 4th at 9:30 AM UTC, we observed a sudden spike in MongoDB utilization, reaching 90% CPU usage, which resulted in degraded cluster performance. This surge led to blocked DB connections due to the load, causing multiple queries to starve for connections and impacting user experience across the platform.
Resolution
Time (UTC) | Event |
---|---|
4th Feb 9:30 AM | CPU Usage on MongoDB went up and persisted |
4th Feb 10:30 AM | Mongo connections spiked, application retries increased |
4th Feb 10:30 AM | Cache dirty fill ratio on Mongo became and persisted at 20%. Writes to MongoDB started to fail |
4th Feb 11:40 AM | Upgraded MongoDB instance tier to M40 to handle the load |
4th Feb 12:00 PM | System stabilized |
We observed that MongoDB CPU utilization spiked. We also observed that MongoDB memory utilization exceeded 90%, leading to an increase in system error rates. During this period, the cache dirty fill ratio started rising and surpassed 20%, remaining elevated. At this point, Mongo application threads became involved in eviction processes instead of executing usual database operations such as CRUD actions, replication, and other core functions.
This shift in thread activity caused operations to stall, leading to excessive memory consumption across the nodes. As system memory utilization increased, overall database performance degraded, further compounding the issue.
To mitigate the impact, we performed a cluster tier upscale on 02/04/25 at 03:40 AM PST, which successfully alleviated the memory pressure on the affected nodes. Following the upgrade, we observed that system performance returned to acceptable levels.
Since the incident, the team has been actively working on several action items to prevent similar occurrences, including: