Customers unable to access Harness on Prod4 Cluster
Incident Report for Harness
Postmortem

Summary:

Customer experienced login failures with 5xx errors on Prod4 cluster.

What was the issue?

Harness platform uses managed memStore internally which experienced “Host error”, this triggered master switchover within seconds. Backend microservices which connect to memStore were not able to reconnect quickly. This issue was with JAVA based services but GO services reconnected properly.

Timeline:

Time Event
21 August 4:06:41 PM UTC Primary memStore went down
21 August 4:07:00 PM UTC Secondary memStore promoted to Primary
21 August 4:06:41 PM UTC Harness services experience RedisResponseTimeoutException
21 August 4:14:30 PM UTC Harness services restores connectivity to new Primary
21 August 4:14:53 PM UTC New instance of memstore added and promoted as Secondary

Resolution:

After 8 min services reconnected to the new primary memStore on its own and things recovered.

RCA

JAVA services use redisson library to connect to memStore. The established connection pool doesn’t detect the endpoint going away and these connections eventually get timed out. In case of graceful failover this issue doesn’t happen and only in case of  catastrophic failure we encounter this issue.

Action Item

  • Detect this catastrophic failure and do a quicker reconnect by services
Posted Sep 04, 2024 - 10:33 PDT

Resolved
We can confirm normal operation. Get Ship Done!
We will continue to monitor and ensure stability.
Posted Aug 21, 2024 - 09:14 PDT
Investigating
We are currently investigating this issue.
Posted Aug 21, 2024 - 09:06 PDT
This incident affected: Prod 4 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Mac Cloud Builds).