Feature Flag service down on Prod1, 2 and 3.

Incident Report for Harness

Postmortem

Summary

Between 05:42 PM and 7:00 PM UTC on Thursday,  Aug 7th, 2025, customers using Feature Flags in Prod1/Prod2/Prod3 clusters encountered “Failed to fetch (5xx)” errors when attempting to read/evaluate feature flags from the backend database. All other harness modules  including CI, CD, STO, SSCA, Code, AR, Pipeline) remained fully functional.

Root Cause

As part of enhancing our Disaster Recovery capabilities to help with the overall availability of Harness platform, we introduced a new architecture leveraging backup-and-restore mechanisms during disaster scenarios. However, this change inadvertently brought down the replica database(s) that were serving read traffic for the Feature Flag service. Under normal circumstances, no service in the live (primary) region should connect to resources in the DR region — this was an isolated, one-off case.

Impact

During the incident, some customers were unable to retrieve and evaluate feature flag definitions when attempting to connect to Harness services returning the default value configured in the customer’s applications.

Timeline

Remediation

  • Immediate: We rolled back to the previous configuration.

Action Items

  1. Rollout and Testing

    1. Perform a gradual cluster rollout at a time with buffer between the change
    2. Update and Run sanities for the feature flag module after infra changes.
    3. Update testing strategy to have a synthetic test which replicates prod workflow for FF service.
  2. Evaluate and enhance Feature flag architecture

    1. Audit all services of the Feature flag product and do optimizations so that we can reduce cross region traffic  and use DR replicas strictly for failover and not for online traffic
  3. Improve Change Management Process

    1. Additional monitoring and further enhance the rollback process so that mitigation is faster.
    2. Improve internal processes by having additional review of changes and broader communication to broadcast changes and support stakeholders
Posted Aug 08, 2025 - 23:05 PDT

Resolved

All systems healthy
Posted Aug 07, 2025 - 12:07 PDT

Update

We are continuing to monitor for any further issues.
Posted Aug 07, 2025 - 12:06 PDT

Monitoring

Prod 1 and Prod 3 are operational, Prod 2 is recovering and we are monitoring
Posted Aug 07, 2025 - 11:34 PDT

Update

Prod 1, Prod 2 and Prod 3 are recovering
Posted Aug 07, 2025 - 11:16 PDT

Identified

Issue has been identified and services coming back up
Posted Aug 07, 2025 - 11:13 PDT

Update

We have identified the issue and looking to remediate the issue ASAP
Posted Aug 07, 2025 - 11:08 PDT

Update

Teams are involved and we are continuing to investigate.
Posted Aug 07, 2025 - 11:06 PDT

Investigating

We are currently investigating this issue.
Posted Aug 07, 2025 - 10:47 PDT
This incident affected: Prod 3 (Feature Flags (FF)), Prod 1 (Feature Flags (FF)), and Prod 2 (Feature Flags (FF)).