At 17:30 UTC, the FF streaming infrastructure experienced a 6x spike in simultaneous connection attempts across customers. This overwhelmed the feature flag service —which is responsible for maintaining long-lived streaming connections—and triggered repeated pod crashes due to memory exhaustion.
During the incident:
Streaming was unavailable.
SDK polling remained unaffected, allowing flag updates to still propagate.
Root Cause
The issue resulted from a convergence of two known risks:
Traffic Surge Loop: The volume of incoming /stream connection requests exceeded service capacity. Once pods restarted, SDKs immediately attempted to reconnect, triggering another wave of load, and creating a crash loop.
SDK Behavior (Node): A bug in Node SDK versions prior to 1.8.6 causes it to exponentially increase reconnect attempts when unable to reach the stream endpoint. This behavior amplified overall traffic, placing additional stress on the system.
Mitigation
We immediately implemented rate limiting on the affected service to reduce saturation and break the crash-retry loop.
Once stabilized, streaming service was restored within 30 minutes of mitigation.
Next Steps & Preventative Measures
Rate Limiting Policy Enforcement: Permanent stream-level rate limiting is being added to protect the service from similar high-volume events in the future.
Customer SDK Outreach: We are resuming targeted outreach to customers using legacy Node SDK versions with known reconnection issues to ensure timely upgrades.
Posted Aug 01, 2025 - 13:25 PDT
Resolved
This incident has been resolved at this moment. RCA will be followed.
Posted Jul 07, 2025 - 11:12 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 07, 2025 - 10:46 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Jul 07, 2025 - 10:37 PDT
This incident affected: Prod 2 (Feature Flags (FF)).