Pipeline executions are failing in Prod-1/2

Incident Report for Harness

Postmortem

What was the issue?

Pipeline executions were stuck and failing due to a Redis instance in the primary region (us-west1) becoming unresponsive. Impact was limited to pipeline executions in NextGen.

‌

Timeline

‌

Time	Event
28 May - 8:56 PM PDT	We received alert on high memory utilization on our Redis instance. Team identified a particularly large cache key and attempted to manually delete the key. This resulted in the Redis instance becoming unstable. Pipeline executions began failing or hanging. We are still working with our Redis service provider to understand why key deletion resulted in instance unresponsiveness. Such key deletions have been done previously without issue, consistent with vendor advice.
28 May - 9:15 PM PDT	We engaged Redis support team, and executed a mitigation plan.
28 May - 9:40 PM PDT	We failed over our application to the secondary Redis instance. The system started to recover. Unfortunately, the secondary region also went into bad state as we migrated load.
28 May - 10:02 PM PDT	The primary redis instance became healthy.
28 May - 10:25 PM PDT	Application traffic was migrated back to primary Redis region. This restored functionality.

‌

RCA & Action Items:

Pipelines started failing due to Redis instance becoming unresponsive in the primary region. This was due to manual deletion of a large key. We are working with the vendor to determine why this action caused instability. We will share more details when they become available.

‌

As part of the immediate action items, we are implementing an upper bound on cache key size. This will prevent us from getting into the state of high memory usage due to large cache keys. In the longer term, we are revisiting our architecture to eliminate large keys in Redis altogether.

Posted May 29, 2024 - 11:21 PDT

Resolved

Harness services are now stable, and our internal sanity check has passed. We will publish more details as soon as our vendor partner, Redis shares the RCA with us. We apologise for the disruption of service.

Posted May 29, 2024 - 00:21 PDT

Monitoring

Service issues have been addressed and normal operations has been resumed. We are monitoring the service to ensure normal performance continues. Thanks you for your patience!

Posted May 28, 2024 - 22:37 PDT

Update

We have identified the issue to be with redis cache. We are working with the vendor to get this fixed and team is working with utmost urgency to get this resolved

Posted May 28, 2024 - 21:15 PDT

Identified

Pipeline executions are failing in Prod-1/2 due to dependency failure.

Posted May 28, 2024 - 21:06 PDT

This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Security Testing Orchestration (STO)) and Prod 1 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Security Testing Orchestration (STO)).