Login failures in prod4

Incident Report for Harness

Postmortem

Summary

On Monday, March 31, 2025, at 6:53 PM UTC, some customers experienced authentication issues, including getting logged out of Harness. This incident affected users with accounts hosted on our Prod-4 cluster. This issue was resolved by 7:06 PM UTC, resulting in approximately 13 minutes of downtime.

Impact

  • Duration: 13 minutes (6:53 PM - 7:06 PM UTC)
  • Affected Users: Customers with accounts hosted on the Prod-4 cluster
  • Symptoms: Authentication failures, unexpected logouts, and traffic drop

Resolution

Our engineering team identified the issue and took immediate action:

  1. Reverted the configuration change at 7:06 PM UTC
  2. Rolled back the deployment to the previous stable version (1.16.0) at 7:08 PM UTC
  3. Verified service restoration across all affected systems

RCA

The incident was caused by a routing configuration error in our Global Gateway service. During a planned deployment, a change to our routing logic inadvertently prevented requests from being correctly directed to the Prod-4 cluster. As a result, authentication sessions for affected customers could not be appropriately maintained.

Action Items

To prevent similar incidents in the future, we are implementing the following improvements:

  1. Improved validation of routing configuration changes
  2. Additional monitoring to detect routing anomalies earlier
Posted Apr 03, 2025 - 14:34 PDT

Resolved

This incident has been resolved.
Posted Mar 31, 2025 - 12:09 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 31, 2025 - 12:05 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted Mar 31, 2025 - 12:00 PDT

Update

We are continuing to investigate this issue.
Posted Mar 31, 2025 - 11:56 PDT

Investigating

We are currently investigating login failures in prod4.
Posted Mar 31, 2025 - 11:54 PDT
This incident affected: Prod 4 (Continuous Delivery - Next Generation (CDNG)).