Availability for a subset of customers on Prod1/Prod2

Incident Report for Harness

Postmortem

Summary

On 12/02/2025 between 9:12:30 AM PDT to 9:26 AM PDT, Harness Gateway service experienced intermittent restarts, which caused temporary disruption in request processing for some customers.

Customer Impact: Harness UI was either unavailable or slow , No running pipelines were affected.

Root Cause

During this period, traffic volume doubled compared to normal levels. This unexpected surge in traffic overwhelmed the database because a key index was missing, which in turn caused the Gateway service to become temporarily unstable.

Mitigation Actions

  • Immediate Response

    • We rate limited the traffic surge by implementing filters based on traffic pattern and fingerprints.
  • Resolution

    • The missing database index has been added, and the Gateway service is now operating normally and handling increased traffic without issues

Next Steps and Improvements

  • To prevent such issues from happening and reducing the impact, additional detection measures to block traffic surges is being implemented.
  • We are proactively optimizing the system and adding additional testing and monitoring to ensure that, in the event of unexpected traffic spikes, our services remain performant and reliable
Posted Dec 03, 2025 - 10:51 PST

Resolved

This incident has been resolved. We will share the RCA as soon as the analysis is completed.
Posted Dec 02, 2025 - 10:37 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Dec 02, 2025 - 09:40 PST

Update

We are continuing to work on a fix for this issue.
Posted Dec 02, 2025 - 09:35 PST

Identified

The issue has been identified and a fix is being implemented.
Posted Dec 02, 2025 - 09:34 PST

Investigating

We are currently investigating this issue.
Posted Dec 02, 2025 - 09:24 PST
This incident affected: Prod 1 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository, Artifact Registry, Platform) and Prod 2 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository, Artifact Registry, Platform).