Harness Overview page showing 400 Error
Incident Report for Harness
Postmortem

Overview

There was an issue reported by Harness customers in the Prod-2 cluster where the Project Overview dashboard was down. This solely impacted the Dashboard API for all prod-2 customers and all other critical functions remained unaffected.

Timeline

Time Event
Nov 27 , 9:30 AM UTC Issue reported on Customer Slack channels.
Nov 27 , 9:49 AM UTC Internally incident was acknowledged and an investigation started.
Nov 27 , 9:57 AM UTC Rolled back system deployment which immediately resolved the issue.
Nov 27 , 10:08 AM UTC Incident resolved from Harness Status Page.

Resolution

The latest deployment was rolled back to restore the Project overview dashboard within 8 minutes of the issue being reported.

Root Cause Analysis (RCA)
The issue was observed post manager service release on 27th Nov on Prod-2. The change included an enhancement to the dashboard service that was incompatible with the service that the dashboard service consumes data from. Regrettably, due to an oversight of coordination in release orchestration, the incompatibility of API contracts across these services was introduced.

Action Items
Our Architecture board will review the deployment management of inter-dependent services and services that use common libraries to avoid running into similar issues.

Posted Nov 28, 2023 - 04:31 PST

Resolved
The issue is resolved after rollback.
Posted Nov 27, 2023 - 02:08 PST
Identified
We have identified the issue and rolled back latest deployment to quickly resolve this issue.
Posted Nov 27, 2023 - 02:03 PST
Investigating
We are currently investigating this issue.
Posted Nov 27, 2023 - 02:02 PST
This incident affected: Prod 2 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Cloud Builds, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM)).