Users from Prod2 cluster faced internal error after logging in
Incident Report for Harness
Postmortem

Overview

There was an issue reported by multiple harness customers in Prod-2 cluster where 500 errors were seen while accessing or trying to run pipelines and licensing information was also inaccessible.

Timeline

Time Event
8 Jan 7:23 AM UTC Issue was reported internally along with Customer reporting.
8 Jan 7:23 AM UTC Internal Incident created.
8 Jan 7:23 AM UTC Rolled back system deployment which immediately resolved the issue.
8 Jan 7:28 AM UTC Internal Incident Resolved.

Resolution

We rolled back our latest system deployment which resolved the issue within 5 minutes of the issue being reported.

Root Cause Analysis

Post our manager service release, a change in licensing resource resulted in cache failures. License API is called to fetch license information to check entitlements of services. Addition of new fields in license resources caused failures which resulted in unhandled exceptions.

Action Item

  • We have implemented exception handling around api calls to handle cache failure that avoids service breakdown
  • Review Cache management during software releases to avoid such failures
Posted Jan 10, 2024 - 06:27 PST

Resolved
The issue was resolved once we did the rollback of the most recent deployment.
Posted Jan 07, 2024 - 23:28 PST
Investigating
We are currently investigating this issue.
Posted Jan 07, 2024 - 23:23 PST
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Integration Enterprise(CIE) - Cloud Builds).