What was the issue?
Pipeline executions were stuck and failing due to a Redis instance in the primary region (us-west1) becoming unresponsive. Impact was limited to pipeline executions in NextGen.
Timeline
Time | Event |
---|---|
28 May - 8:56 PM PDT | We received alert on high memory utilization on our Redis instance. Team identified a particularly large cache key and attempted to manually delete the key. This resulted in the Redis instance becoming unstable. Pipeline executions began failing or hanging. We are still working with our Redis service provider to understand why key deletion resulted in instance unresponsiveness. Such key deletions have been done previously without issue, consistent with vendor advice. |
28 May - 9:15 PM PDT | We engaged Redis support team, and executed a mitigation plan. |
28 May - 9:40 PM PDT | We failed over our application to the secondary Redis instance. The system started to recover. Unfortunately, the secondary region also went into bad state as we migrated load. |
28 May - 10:02 PM PDT | The primary redis instance became healthy. |
28 May - 10:25 PM PDT | Application traffic was migrated back to primary Redis region. This restored functionality. |
RCA & Action Items:
Pipelines started failing due to Redis instance becoming unresponsive in the primary region. This was due to manual deletion of a large key. We are working with the vendor to determine why this action caused instability. We will share more details when they become available.
As part of the immediate action items, we are implementing an upper bound on cache key size. This will prevent us from getting into the state of high memory usage due to large cache keys. In the longer term, we are revisiting our architecture to eliminate large keys in Redis altogether.