Prod2: Feature flag functionality is currently degraded

Incident Report for Harness

Postmortem

Incident Summary:

SDK streaming for Feature Flags returned 502s for a period of time.

This was caused by an issue with the autoscaling rules on the streaming service. The service didn’t scale up as load increased which caused the pods to run out of memory and restart.

Once identified by the team, the cloud engineer on call was able to manually scale up the service, and the service was restored.

Timeline
Time (UTC) Event
18:40 PM Alert fires due to pod restarts. On call engineer is paged and begins investigation
18:50 PM Service gets into a degraded state, returning an increase in errors for stream requests
18:57 PM Issue identified as large increasing volume of /stream requests causing streaming service pods to run out of memory and restart
19:15 PM Service is manually scaled up
19:35 PM Rate limiting rule applied and operational
19:40 PM Pods are stable, request rates return to normal
19:44 AM Incident resolved

Root Cause Analysis:

The incident originated from a broken autoscaling rule on Harness' Prod 2 environment, causing the Feature Flag streaming service to go into a degraded state.

The pods began to run out of memory and cpu, causing existing streams to be disconnected leading to increased traffic and a cascading failure of the available streaming pods.

Immediate Resolution:

To address the incident promptly, the team increased the resource capacity of the affected service, as well as applying rate limiting to the streaming service to allow sdks to gradually connect until the service was able to resume normal operations.

Preventive Measures:

To prevent similar incidents in the future we address needed improvements, resources have been increased in the affected cluster.

Action Items:
The faulty autoscaling rule has been identified and is being resolved.

Posted Aug 01, 2025 - 10:31 PDT

Resolved

This incident has been resolved.
Posted Jul 29, 2025 - 13:09 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jul 29, 2025 - 12:48 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted Jul 29, 2025 - 12:47 PDT

Investigating

We are currently investigating the issue
Posted Jul 29, 2025 - 12:33 PDT
This incident affected: Prod 2 (Feature Flags (FF)).