We are experiencing Login failure in Prod1 and Prod2 clusters

Incident Report for Harness

Postmortem

Summary:

On September 3, 2025, 8:24 AM – 8:53 AM PDT, A subset of users were intermittently getting logged out.

Root Cause:
One authentication pod ran out of memory and failed to restart due to an issue in its token caching logic. This caused cache corruption and subsequent token validation failures.

Resolution:
We restarted the affected pod, which immediately restored service stability.

Next Steps:

  • The team has updated caching logic to prevent memory buildup under error conditions.
  • We have optimized configuration to exit cleanly on OutOfMemoryError for faster recovery.
  • Further strengthen monitoring and alerting to detect early signs of similar issues.
Posted Sep 08, 2025 - 22:10 PDT

Resolved

This incident has been resolved.
Posted Sep 04, 2025 - 02:52 PDT

Monitoring

We are monitoring the login and failures at this time
Posted Sep 03, 2025 - 08:59 PDT

Investigating

We are currently investigating this issue.
Posted Sep 03, 2025 - 08:47 PDT
This incident affected: Prod 1 (Platform) and Prod 2 (Platform).