Customer experienced login failures with 5xx errors on Prod4 cluster.
Harness platform uses managed memStore internally which experienced “Host error”, this triggered master switchover within seconds. Backend microservices which connect to memStore were not able to reconnect quickly. This issue was with JAVA based services but GO services reconnected properly.
Time | Event |
---|---|
21 August 4:06:41 PM UTC | Primary memStore went down |
21 August 4:07:00 PM UTC | Secondary memStore promoted to Primary |
21 August 4:06:41 PM UTC | Harness services experience RedisResponseTimeoutException |
21 August 4:14:30 PM UTC | Harness services restores connectivity to new Primary |
21 August 4:14:53 PM UTC | New instance of memstore added and promoted as Secondary |
After 8 min services reconnected to the new primary memStore on its own and things recovered.
JAVA services use redisson library to connect to memStore. The established connection pool doesn’t detect the endpoint going away and these connections eventually get timed out. In case of graceful failover this issue doesn’t happen and only in case of catastrophic failure we encounter this issue.
Action Item