On April 6th, during our scheduled production deployment, multiple customers could not log in because the services failed to start due to issues with the index build.
Our monitoring system alerted us to the issue. Upon investigation, we identified an unexpected heavy load on the database resulting in service failures . In response, we initiated a system rollback which resolved the issue.
As part of our planned deployment in the production environment (prod1), indexes are created during service startup. However, the combination of high I/O activity on a specific collection and concurrent index creation led to resource contention in MongoDB due to locking which remained longer than usual.
As a result, few critical services failed to start up causing the login issue. We are currently awaiting a root cause analysis (RCA) from our managed MongoDB service provider to understand the underlying cause of the issue from their side.