On Thursday, for a short period of time between 14:32 UTC and 14:46 UTC, customers starting receiving 502 error codes in response to some SDK authentication and evaluation requests. The users were not able to toggle flags via the admin user interface during this time.
Roll back recent CDN rollout for all customers
Scaled up FF services to normal.
All times are in UTC
Time | Event | Notes |
---|---|---|
Sep 21, 2023, 2:32:23 PM | First synthetic check failed | |
Sep 21, 2023, 2:43:06 PM | First alert came through the ff-alerts channel, with an Uptime check failed | |
ff-admin-server Pods started to restart in Prod 2 | ||
ff-client-server Pods in a not ready state | ||
SRE jumped on to start debugging the issue | ||
Quickly identified Redis CPU being at 100% for a period | ||
Looking at the logs of the pods that were restarting, error message indicating the ff-admin-server was unable to connect to the Redis instance | ||
Scaled down the Pods for the ff-admin-server and ff-client-server deployment to 2 to reduce load on Redis | ||
Once the Redis CPU dropped from 100%, pods brought back up to previous levels | ||
Sep 21, 2023, 2:46:23 PM | Incident raised as a FH | |
Sep 21, 2023, 2:46:23 PM | Last Failed health check reported | |
Sep 21, 2023, 2:46:23 PM | Manual testing on Prod2 demo app, works as expected | |
Sep 21, 2023, 2:46:23 PM | Decision to go revert CDN rollout for all customers | CDN was using an expensive query, that was resulting the Redis CPU spiking |
The RCA follows the same pattern for the same issue that occurred on Friday 15th. A permanent fix for the problem is expected to be deployed into production on September 22nd. The fix resolves the Redis keys command issue described below.
The RCA below has been split into two.
The admin services liveness probe started failing. As a result kubernetes started to restart the service.
The probe uses the admin services healthcheck endpoint. This checks that all downstream services are available and responding. Requests to the redis service had started to timeout.
The redis service was reporting 100% CPU usage. While it was processing commands there was increased latency which resulted in many calls timing out.
We recently upgraded from Redis v5 to Redis v6. Redis v6 is multi-threaded and can perform operations concurrently.
One of the redis calls used by Feature Flags CDN solution is called ‘keys’. This returns keys based on a pattern match. Although it uses a pattern match in the command, the execution time for the command increases linearly as the total number of keys in cache increases. During the peak part of the day as the number of keys in the cache grew, it reached the point where the command took longer than the configured timeout to return.
The logs indicated requests were starting to timeout. As a result a 502 was returned.
During the authentication and evaluation flows Feature Flags makes use of redis to cache data. This allows us to scale the service. Some of the redis queries started to timeout.
The redis service CPU had spiked at 100%. Normally if we cannot find an item in the cache we fall through to the underlying database, but in this case because the request timed out we failed to hit the underlying database to load the data.
(see why 4 from the admin service failures)
(see why 5 from the admin service failures)
There are two use-cases load tested prior to release
The part of the solution that used the expensive Redis calls occurs during requests that are served out of the CDN to guarantee the requested ETag is still up to date. This typically operates over a small set of keys, so when load testing only a small set of a few thousand keys are used.
When testing evaluation calls, this normally has a cache with a few hundred thousand keys. But this flow does not use expensive redis calls.
These flows are tested in isolation in our testing environment, but the issue only manifests when combining the two scenarios.