Stale Data Observed for Custom Dashboards in Prod1

Incident Report for Harness

Postmortem

Summary

On March 25, 2025, for 2 hours and 22 minutes, customers in the prod-1 production environment observed stale data on the following custom dashboards: pipeline executions, stage executions, and step executions.

What was the issue?

The metadata state tables managing the ETL process were corrupted during a version upgrade, requiring fixes to this table. No data was lost during this process.

Resolution

The metadata state was reset to trigger data mart updates.

Time(UTC) Event
26 Mar 2:04 AM We identified the ETL process that timed out after the upgrade.
26 Mar 3:18 PM Redeployed the ETL process, applied the plan, and recreated the views.
26 Mar 4:22 AM The metadata schema was rebuilt, and all data quality checks were confirmed to be passing.
26 Mar 4:25 AM The incident was resolved.

RCA

Plan application errors were due to an upgrade of ETL process timing out after running for two hours. This resulted in metadata corruption, requiring data fixing. While no data loss was experienced, data staleness was observed because the data marts were not updated with the latest ETL intervals during the metadata recreation.

Action Items

  • Update the ETL framework frequently to avoid significant version number jumps**.**
  • Set up a regular cadence for testing new updates and deploying them into production.
Posted Apr 01, 2025 - 03:38 PDT

Resolved

This incident has been resolved. Thanks for your patience.
Posted Mar 25, 2025 - 21:26 PDT

Update

We are working towards testing a fix in our dev environment.
Posted Mar 25, 2025 - 21:13 PDT

Update

We are continuing to work on a fix for the issue.
Posted Mar 25, 2025 - 20:28 PDT

Update

We are continuing to work on a fix for the issue.
Posted Mar 25, 2025 - 19:42 PDT

Identified

We are working on a fix. We have identified that only Unified Dashboards for pipeline, stage, and steps are currently impacted.
Posted Mar 25, 2025 - 19:05 PDT

Investigating

We are currently investigating this issue.
Posted Mar 25, 2025 - 19:04 PDT
This incident affected: Prod 1 (Custom Dashboards).