Prod2: Feature flag is currently degraded

Incident Report for Harness

Postmortem

Incident Overview

At 17:30 UTC, the FF streaming infrastructure experienced a 6x spike in simultaneous connection attempts across customers. This overwhelmed the feature flag service —which is responsible for maintaining long-lived streaming connections—and triggered repeated pod crashes due to memory exhaustion.

During the incident:

Streaming was unavailable.
SDK polling remained unaffected, allowing flag updates to still propagate.

Root Cause

The issue resulted from a convergence of two known risks:

Traffic Surge Loop: The volume of incoming /stream connection requests exceeded service capacity. Once pods restarted, SDKs immediately attempted to reconnect, triggering another wave of load, and creating a crash loop.
SDK Behavior (Node): A bug in Node SDK versions prior to 1.8.6 causes it to exponentially increase reconnect attempts when unable to reach the stream endpoint. This behavior amplified overall traffic, placing additional stress on the system.

Mitigation

We immediately implemented rate limiting on the affected service to reduce saturation and break the crash-retry loop.
Once stabilized, streaming service was restored within 30 minutes of mitigation.

Next Steps & Preventative Measures

Rate Limiting Policy Enforcement: Permanent stream-level rate limiting is being added to protect the service from similar high-volume events in the future.
Customer SDK Outreach: We are resuming targeted outreach to customers using legacy Node SDK versions with known reconnection issues to ensure timely upgrades.

Posted Aug 01, 2025 - 13:25 PDT

Resolved

This incident has been resolved at this moment. RCA will be followed.

Posted Jul 07, 2025 - 11:12 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 07, 2025 - 10:46 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted Jul 07, 2025 - 10:37 PDT

This incident affected: Prod 2 (Feature Flags (FF)).