K8s Customer Billing Data will be outdated

Incident Report for Harness

Postmortem

Summary

On May 4, 2026, after a service deployment in Prod2, background schema migrations did not complete successfully. As a result, some cluster and perspective-related data appeared missing or stale for Elevance and a small number of other Prod2 customers.

The service itself remained available, but the database schema was not updated to match the new application code. This caused downstream billing and cluster data processing jobs to fail or skip expected data updates.

Impact

  • Affected customers saw missing or stale cluster and perspective data in the UI.
  • Data appeared to stop updating around April 30 for impacted accounts.
  • Elevance was the primary impacted customer, along with a few other Prod2 customers.
  • There was no full service downtime.
  • Once remediated, the missing data was backfilled.

Root Cause

During the Prod2 deployment , the background schema migration process attempted to acquire a Redis distributed lock before running Timescale database migrations.

The lock acquisition failed immediately due to a PersistentLockException, likely because another replica or overlapping deployment process was holding the lock. Since the migration used a zero-wait lock acquisition path, it did not retry and the migration did not run.

The failure was logged only as a generic warning rather than a clear production error. Because of this, the migration failure was not immediately surfaced through alerts, and the service continued running with the database schema behind the application code expectations.

Remediation

The service was redeployed in Prod2. On restart, the background migrations completed successfully. The team then reran the required job to backfill the missing data and restore expected cluster and perspective data visibility.

Preventive Actions

  • Update background migration locking behavior to wait/retry when acquiring the migration lock.
  • Improve logging from generic warnings to explicit error logs when schema migrations are skipped.
  • Add alerting for failed or skipped production schema migrations.
Posted May 19, 2026 - 17:03 PDT

Resolved

This incident has been resolved.
Posted May 05, 2026 - 01:17 PDT

Investigating

We are currently experiencing an issue, K8s customer billing data will be stale.
Posted May 05, 2026 - 00:57 PDT
This incident affected: Prod 2 (Cloud Cost Management (CCM)).