Prod1: Unified Dashboards may be experiencing delays

Incident Report for Harness

Postmortem

Summary

On Jun 05, 2025 - 07:41 PDT, for 1 hour, 12 minutes, customers on Prod-1 observed stale data on the following custom dashboards: pipeline executions, stage executions, and step executions.

What was the issue?

The root cause was a breaking change introduced during the upgrade of the ETL pipeline version. This failed to run the ETL pipeline at the regular cadence, leading to stale data on the dashboards. No data was lost during this process.

Mitigation

The engineering team downgraded the ETL service to the previous working version after identifying the cause of the issue.

Root Cause Analysis

The issue was due to a breaking behavioral change in the ETL service, which resulted in failing to run at a regular cadence.

Action Items

  • The engineering team has added alerts for ETL service failure resulting from version upgrades. 
  • We have also improved our internal processes such as runbooks  for handling version changes for the ETL service.
Posted Jun 17, 2025 - 11:27 PDT

Resolved

This incident has been resolved.
Posted Jun 05, 2025 - 08:53 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jun 05, 2025 - 08:47 PDT

Investigating

Some of our unified dashboards might be experiencing delays
Posted Jun 05, 2025 - 07:41 PDT
This incident affected: Prod 1 (Custom Dashboards).