Incident Summary:
Due to an increase in traffic, there was a period of high latency experienced on the Feature Flag metrics service
This was caused by the service not able to scale up quickly enough to handle the additional load automatically, and the service to become slow, and returning errors.
Once identified by the team, the cloud engineer was able to manually scale up the service, and the service was restored
Timeline
Time (UTC) | Event |
---|---|
18:11 PM | Large number of requests seen coming through the network |
18:14 PM | Service gets into a degraded state, returning an increase in errors, and latency |
18:14 PM | On call engineer is alerted and begins investigation |
18:24 PM | Service is manually scaled up to handle the load |
18:24 PM | Development team begin RCA |
18:41 PM | All requests return to normal operational behaviour |
18:41 PM | Incident resolved |
Root Cause Analysis:
The incident originated from an increased rate of requests on the Prod 1 environment, causing the Feature Flag metrics service to get into a degraded state.
While the service has auto-scaling capabilities in place, the sudden increase, and size of the increase resulted in the automated scaling to be inefficient, and manual intervention was required
Immediate Resolution:
To address the incident promptly, the team increased the resource capacity of the affected service, until the service was able to resume normal operations.
Preventive Measures:
To prevent similar incidents in the future while the team are addressing working on improvements, resources have been adjusted in the affected cluster to better handle sudden traffic spikes
Action Items:
We have identified a number of bottlenecks that resulted in the incident, and the development team are actively working on improvements