PROD2: Login is failing

Incident Report for Harness

Postmortem

Summary 

On April 8th, in preparation for our scheduled deployment, we started an index build. This caused the database to become unresponsive, resulting in login failures for few customers.

Resolution

Our monitoring systems alerted us to the issue. In response, we initiated an index rollback to restore database responsiveness and mitigate customer impact.

RCA

To support upcoming changes in the new deployment, we followed best practices and suggestions from MongoDB and began index creation ahead of time.

However, high I/O activity on the target collection caused both index and data storage to consume significantly more space than anticipated. The increased storage and index size lead to poor performance of the database. This was a result of how our managed MongoDB service provider handles storage management internally.

As a result, the db becomes unresponsive leading to login failures. We are currently awaiting a root cause analysis (RCA) from our managed MongoDB service provider to understand the underlying cause of the issue from their side.

Action Items

  • We have disabled index building on the specific db collection in question for short term.
  • We are actively working with MongoDB support to investigate and identify the root cause of the issue.
Posted Apr 18, 2025 - 18:25 PDT

Resolved

The incident has been resolved.
A detailed Root Cause Analysis (RCA) will be shared.
Posted Apr 08, 2025 - 03:26 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Apr 08, 2025 - 02:42 PDT

Investigating

We are currently investigating this issue.
Posted Apr 08, 2025 - 02:37 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Apr 08, 2025 - 02:32 PDT

Investigating

We are currently investigating this issue.
Posted Apr 08, 2025 - 02:24 PDT
This incident affected: Prod 2 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository).