Users experienced login failures on the Prod4 cluster due to backend connection limits being exceeded. The issue was triggered by a surge of WebSocket connections following a customer account migration, which triggered the circuit breaker limit on Harness Global Gateway for the Prod4 cluster.
Time (UTC) | Event |
---|---|
March 2nd 2:14 PM | We received an alert for Prod4 login failures |
March 2nd 3:00 PM | Scaled Global Gateway pods from 2 to 4 and functionality restored. |
March 2nd 3:06 PM | Increased RPS for Prod4 ILB |
March 2nd 3:10 PM | Confirmed that login is restored |
Following a customer account migration, there was a significant increase in WebSocket connections from delegate agents, exceeding the connection limits set for backend hosts. The backend system reached its maximum capacity, preventing new connections from being established. Additionally, one of the backend pods restarted unexpectedly, leaving only a single pod to handle all incoming traffic. This led to the circuit breaker being activated, causing login failures.