Between 07:28 UTC and 08:00 UTC on Tuesday, October 15, 2025, customers using Test Intelligence on the Prod2 cluster experienced service unavailability. The Tests tab returned 503 errors, preventing customers from viewing test reports and test execution results. All other services and clusters remained fully functional throughout the incident.
A database maintenance operation responsible for managing data partitions required an exclusive lock on database tables. This operation conflicted with a long-running account data cleanup task, causing database queries to queue up and time out. The Test Intelligence service health checks failed due to database connection timeouts, resulting in service unavailability.
Customers using Test Intelligence in the Prod2 environment experienced the following:
All other Harness services and production clusters operated normally without impact.
Immediate: Monitored service recovery as database operations completed and connectivity was restored. Adjusted partition maintenance schedule to reduce frequency of conflicts.
Permanent: Implementing improved data cleanup mechanisms with batched operations to prevent database lock contention. Coordinating database maintenance windows to avoid operational conflicts.
Action Items