Customers unable to access Harness on Prod4 Cluster

Incident Report for Harness

Postmortem

Summary:

Customer experienced login failures with 5xx errors on Prod4 cluster.

What was the issue?

Harness platform uses managed memStore internally which experienced “Host error”, this triggered master switchover within seconds. Backend microservices which connect to memStore were not able to reconnect quickly. This issue was with JAVA based services but GO services reconnected properly.

Timeline:

Time	Event
21 August 4:06:41 PM UTC	Primary memStore went down
21 August 4:07:00 PM UTC	Secondary memStore promoted to Primary
21 August 4:06:41 PM UTC	Harness services experience RedisResponseTimeoutException
21 August 4:14:30 PM UTC	Harness services restores connectivity to new Primary
21 August 4:14:53 PM UTC	New instance of memstore added and promoted as Secondary

Resolution:

After 8 min services reconnected to the new primary memStore on its own and things recovered.

RCA

JAVA services use redisson library to connect to memStore. The established connection pool doesn’t detect the endpoint going away and these connections eventually get timed out. In case of graceful failover this issue doesn’t happen and only in case of catastrophic failure we encounter this issue.

Action Item

Detect this catastrophic failure and do a quicker reconnect by services

Posted Sep 04, 2024 - 10:33 PDT

Resolved

We can confirm normal operation. Get Ship Done!
We will continue to monitor and ensure stability.

Posted Aug 21, 2024 - 09:14 PDT

Investigating

We are currently investigating this issue.

Posted Aug 21, 2024 - 09:06 PDT

This incident affected: Prod 4 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Mac Cloud Builds).