We are experiencing Login failure in Prod1 and Prod2 clusters

Incident Report for Harness

Postmortem

Summary:

On September 3, 2025, 8:24 AM – 8:53 AM PDT, A subset of users were intermittently getting logged out.

Root Cause:
One authentication pod ran out of memory and failed to restart due to an issue in its token caching logic. This caused cache corruption and subsequent token validation failures.

Resolution:
We restarted the affected pod, which immediately restored service stability.

Next Steps:

The team has updated caching logic to prevent memory buildup under error conditions.
We have optimized configuration to exit cleanly on OutOfMemoryError for faster recovery.
Further strengthen monitoring and alerting to detect early signs of similar issues.

Posted Sep 08, 2025 - 22:10 PDT

Resolved

This incident has been resolved.

Posted Sep 04, 2025 - 02:52 PDT

Monitoring

We are monitoring the login and failures at this time

Posted Sep 03, 2025 - 08:59 PDT

Investigating

We are currently investigating this issue.

Posted Sep 03, 2025 - 08:47 PDT

This incident affected: Prod 1 (Platform) and Prod 2 (Platform).