Feature Flag metrics service degraded

Incident Report for Harness

Postmortem

Incident Summary:

Due to an increase in traffic, there was a period of high latency experienced on the Feature Flag metrics service

This was caused by the service not able to scale up quickly enough to handle the additional load automatically, and the service to become slow, and returning errors.

Once identified by the team, the cloud engineer was able to manually scale up the service, and the service was restored

Timeline

Time (UTC)	Event
18:11 PM	Large number of requests seen coming through the network
18:14 PM	Service gets into a degraded state, returning an increase in errors, and latency
18:14 PM	On call engineer is alerted and begins investigation
18:24 PM	Service is manually scaled up to handle the load
18:24 PM	Development team begin RCA
18:41 PM	All requests return to normal operational behaviour
18:41 PM	Incident resolved

Root Cause Analysis:

The incident originated from an increased rate of requests on the Prod 1 environment, causing the Feature Flag metrics service to get into a degraded state.

While the service has auto-scaling capabilities in place, the sudden increase, and size of the increase resulted in the automated scaling to be inefficient, and manual intervention was required

Immediate Resolution:

To address the incident promptly, the team increased the resource capacity of the affected service, until the service was able to resume normal operations.

Preventive Measures:

To prevent similar incidents in the future while the team are addressing working on improvements, resources have been adjusted in the affected cluster to better handle sudden traffic spikes

Action Items:
We have identified a number of bottlenecks that resulted in the incident, and the development team are actively working on improvements

Posted Feb 07, 2024 - 09:06 PST

Resolved

The incident has been resolved. We will provide a postmortem once we have gathered all the details.

Posted Feb 05, 2024 - 10:45 PST

Investigating

We are currently investigating unusually high latency on the Feature Flag metrics service, due to a high volume of traffic

Posted Feb 05, 2024 - 10:35 PST

This incident affected: Prod 1 (Feature Flags (FF)).