Test Reports are impacted in Prod2

Incident Report for Harness

Postmortem

Summary

Between 07:28 UTC and 08:00 UTC on Tuesday, October 15, 2025, customers using Test Intelligence on the Prod2 cluster experienced service unavailability. The Tests tab returned 503 errors, preventing customers from viewing test reports and test execution results. All other services and clusters remained fully functional throughout the incident.

Root Cause

A database maintenance operation responsible for managing data partitions required an exclusive lock on database tables. This operation conflicted with a long-running account data cleanup task, causing database queries to queue up and time out. The Test Intelligence service health checks failed due to database connection timeouts, resulting in service unavailability.

Impact

Customers using Test Intelligence in the Prod2 environment experienced the following:

  • Tests tab returned 503 errors
  • Inability to view test reports - Test execution results and historical data were inaccessible

All other Harness services and production clusters operated normally without impact.

Remediation

Immediate: Monitored service recovery as database operations completed and connectivity was restored. Adjusted partition maintenance schedule to reduce frequency of conflicts.

Permanent: Implementing improved data cleanup mechanisms with batched operations to prevent database lock contention. Coordinating database maintenance windows to avoid operational conflicts.

Action Items

  • Coordinate maintenance schedules – Ensure database partition management and cleanup operations run during non-overlapping maintenance windows. Partition management operation frequency reduced
  • Improve service resilience – Extend health check timeouts and implement graceful degradation patterns to handle transient database delays
  • Update the priority on readiness probe alert to generate event immediately
  • Increase timeout for data deletion for an account
  • Being proactive and work on preventing such issues happening again by adding granular monitoring and high watermark metrics to make sure we can mitigate before we see any signs of degradation.
Posted Oct 27, 2025 - 09:41 PDT

Resolved

The incident has been resolved. We’ll share the RCA at the earliest.
Posted Oct 15, 2025 - 01:44 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Oct 15, 2025 - 01:10 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted Oct 15, 2025 - 00:37 PDT
This incident affected: Prod 2 (Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds).