Feature Flag services is not responding

Incident Report for Harness

Postmortem

Summary

On Thursday, for a short period of time between 14:32 UTC and 14:46 UTC, customers starting receiving 502 error codes in response to some SDK authentication and evaluation requests. The users were not able to toggle flags via the admin user interface during this time.

 

Total Downtime

  • Downtime taken: 14min
  • Resolution time*: 14min
  • Resolution time = time reported to time restored, either through Rollback or HF

Immediate Resolution

  • Scaled down FF services, to reduce connected clients to the Redis instance, until CPU utilisation dropped below 100%
  • Roll back recent CDN rollout for all customers

    • Cache is now handled at the server level
  • Scaled up FF services to normal.

Timeline

All times are in UTC

Time Event Notes
Sep 21, 2023, 2:32:23 PM First synthetic check failed  
Sep 21, 2023, 2:43:06 PM First alert came through the ff-alerts channel, with an Uptime check failed  
  ff-admin-server Pods started to restart in Prod 2  
  ff-client-server Pods in a not ready state  
  SRE jumped on to start debugging the issue  
  Quickly identified Redis CPU being at 100% for a period  
  Looking at the logs of the pods that were restarting, error message indicating the ff-admin-server was unable to connect to the Redis instance  
  Scaled down the Pods for the ff-admin-server and ff-client-server deployment to 2 to reduce load on Redis  
  Once the Redis CPU dropped from 100%, pods brought back up to previous levels  
Sep 21, 2023, 2:46:23 PM Incident raised as a FH  
Sep 21, 2023, 2:46:23 PM Last Failed health check reported  
Sep 21, 2023, 2:46:23 PM Manual testing on Prod2 demo app, works as expected  
Sep 21, 2023, 2:46:23 PM Decision to go revert CDN rollout for all customers CDN was using an expensive query, that was resulting the Redis CPU spiking

RCA

The RCA follows the same pattern for the same issue that occurred on Friday 15th. A permanent fix for the problem is expected to be deployed into production on September 22nd. The fix resolves the Redis keys command issue described below.

The RCA below has been split into two.

  1. Why the feature flag admin service restarted preventing users from toggling flags.
  2. Why SDK requests failed with 502.

Admin Server Restarts

Why 1 - the ff admin service started to restart

The admin services liveness probe started failing. As a result kubernetes started to restart the service.

Why 2 - the liveness probe started to fail

The probe uses the admin services healthcheck endpoint. This checks that all downstream services are available and responding. Requests to the redis service had started to timeout.

Why 3 - the redis health check started to fail

The redis service was reporting 100% CPU usage. While it was processing commands there was increased latency which resulted in many calls timing out.

Why 4 - the did redis CPU spike at 100%

We recently upgraded from Redis v5 to Redis v6. Redis v6 is multi-threaded and can perform operations concurrently.

Why 5 - why did running more commands concurrently increase CPU usage to 100%

One of the redis calls used by Feature Flags CDN solution is called ‘keys’. This returns keys based on a pattern match. Although it uses a pattern match in the command, the execution time for the command increases linearly as the total number of keys in cache increases. During the peak part of the day as the number of keys in the cache grew, it reached the point where the command took longer than the configured timeout to return.

 

Evaluation Service Failures

Why 1 - did the evaluation service start returning 502 errors

The logs indicated requests were starting to timeout. As a result a 502 was returned.

Why 2- did requests timeout

During the authentication and evaluation flows Feature Flags makes use of redis to cache data. This allows us to scale the service. Some of the redis queries started to timeout.

Why 3 - did Redis calls time out

The redis service CPU had spiked at 100%. Normally if we cannot find an item in the cache we fall through to the underlying database, but in this case because the request timed out we failed to hit the underlying database to load the data.

Why 4 - the did redis CPU spike at 100%

(see why 4 from the admin service failures)

Why 5 - why did running more commands concurrently increase CPU usage to 100%

(see why 5 from the admin service failures)

 

Why was this not identified during testing

There are two use-cases load tested prior to release

  • Serving requests from CDN

The part of the solution that used the expensive Redis calls occurs during requests that are served out of the CDN to guarantee the requested ETag is still up to date. This typically operates over a small set of keys, so when load testing only a small set of a few thousand keys are used.

 

  • Performing live evaluations

When testing evaluation calls, this normally has a cache with a few hundred thousand keys. But this flow does not use expensive redis calls.

These flows are tested in isolation in our testing environment, but the issue only manifests when combining the two scenarios.

Follow up actions

  • Update the admin server k8 probe. It should not trigger restarts as the result of a downstream service.
  • Replace the redis ‘keys’ command with a call that completes in O(1) time. This ensures consistent response times regardless of how big the cache grows
  • Review the architecture. If the Redis call times out the overall requests fails. We should still be able to fall back to the underlying database, although this will increase the response time it will allow the solution to continue to serve traffic even while some services are suffering a complete outage.
Posted Sep 21, 2023 - 14:26 PDT

Resolved

We noticed spikes in Redis that caused the Feature Flag service to become inoperative, we have already resolved the problem and are monitoring our services!

Get ship Done!
Posted Sep 21, 2023 - 07:59 PDT

Investigating

We noticed that Redis CPU spike on FF causing FF services to not respond and we are already working to identify the issue.
Posted Sep 21, 2023 - 07:53 PDT
This incident affected: Prod 2 (Feature Flags (FF)).