Seeing intermittent login issues on our Prod4 environment

Incident Report for Harness

Postmortem

Summary:

Users experienced login failures on the Prod4 cluster due to backend connection limits being exceeded. The issue was triggered by a surge of WebSocket connections following a customer account migration, which triggered the circuit breaker limit on Harness Global Gateway for the Prod4 cluster.

Timeline:

Time (UTC) Event
March 2nd 2:14 PM We received an alert for Prod4 login failures
March 2nd 3:00 PM Scaled Global Gateway pods from 2 to 4 and functionality restored.
March 2nd 3:06 PM Increased RPS for Prod4 ILB
March 2nd 3:10 PM Confirmed that login is restored

Resolution:

  • Increased backend capacity by scaling the Global Gateway service to distribute the load more effectively.
  • Set up necessary alerts to monitor system stability to confirm cluster connectivity.

RCA:

Following a customer account migration, there was a significant increase in WebSocket connections from delegate agents, exceeding the connection limits set for backend hosts. The backend system reached its maximum capacity, preventing new connections from being established. Additionally, one of the backend pods restarted unexpectedly, leaving only a single pod to handle all incoming traffic. This led to the circuit breaker being activated, causing login failures.

Action Item:

  • Implement a dedicated traffic splitting configuration to handle WebSocket connections separately from other API requests to prevent similar incidents in the future.
  • Improve monitoring and alerting to detect when connection limits are approaching critical thresholds.
  • Conduct scalability testing to ensure the system can handle large numbers of WebSocket connections without reaching critical limits.
Posted Mar 05, 2025 - 21:56 PST

Resolved

After continued monitoring and further investigation issue has been considered resolved
Posted Mar 02, 2025 - 08:57 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 02, 2025 - 08:18 PST

Update

We have identified the issue, and a migration has been applied.

The team are continuing to investigate the source of the issue
Posted Mar 02, 2025 - 08:18 PST

Identified

Issue has been identified with our global gateway, affecting routing to our Prod4 environment.

Team is continuing to investigate the issue
Posted Mar 02, 2025 - 08:07 PST

Investigating

Seeing intermittent login issues on our Prod4 environment
Posted Mar 02, 2025 - 07:42 PST
This incident affected: Prod 4 (Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Chaos Engineering, Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Code Repository).