Pipeline executions are observing degraded performance

Incident Report for Harness

Postmortem

Summary

On Wednesday, September 9, 2025, between 8:58 AM and 9:49 AM PST, customers experienced slower pipeline runs in the Prod2 cluster. Pipelines that started during this time took longer than usual to finish. All pipelines eventually completed, and no other parts of the product or service were affected.

Root Cause

During this period, there was a sudden and a sharp increase in the number of pipeline runs. This extra load slowed down the database, which in turn made the backend services slower. As a result, customers saw delays in their pipeline executions.

Impact

  • Pipelines that were started during the incident window ran slower than normal.
  • Individual stages and steps also took longer to begin, and this delay was visible in the Harness UI.

Remediation

  • Immediate: Scaled database server to handle increased load

Action Items

  1. Optimize Expression Engine to reduce database usage so that our Databases have more headroom and can handle spikes
  2. Optimize the recovery time by identifying the issue early and do preventive scale ups
Posted Sep 14, 2025 - 17:29 PDT

Resolved

This incident has been resolved.
Posted Sep 10, 2025 - 14:37 PDT

Update

We have identified and mitigated the issue, and are currently monitoring.
Posted Sep 10, 2025 - 09:47 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Sep 10, 2025 - 09:46 PDT
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO), Infrastructure as Code Management (IaCM)).