PROD1: Unable to login

Incident Report for Harness

Postmortem

Summary 

On April 6th, during our scheduled production deployment, multiple customers could not log in because the services failed to start due to issues with the index build.

Resolution

Our monitoring system alerted us to the issue. Upon investigation, we identified an unexpected heavy load on the database resulting in service failures . In response, we initiated a system rollback which resolved the issue. 

RCA

As part of our planned deployment in the production environment (prod1), indexes are created during service startup. However, the combination of high I/O activity on a specific collection and concurrent index creation led to resource contention in MongoDB due to locking which remained longer than usual. 

As a result, few critical services failed to start up causing the login issue. We are currently awaiting a root cause analysis (RCA) from our managed MongoDB service provider to understand the underlying cause of the issue from their side.

Action Items

  • Index creation during service startup has been disabled as part of the deployment process.
  • We are actively working with MongoDB support to investigate and identify the root cause of the issue.
Posted Apr 18, 2025 - 18:20 PDT

Resolved

This incident currently stands resolved.
We will publish an RCA.
Posted Apr 06, 2025 - 23:56 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Apr 06, 2025 - 23:08 PDT

Investigating

The Harness service is currently unavailable. We are currently working to identify the root cause and restore the service as soon as possible.
Posted Apr 06, 2025 - 22:46 PDT
This incident affected: Prod 1 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository).