Pipeline executions are observing degraded performance

Incident Report for Harness

Postmortem

Summary

On Wednesday, September 10, 2025, between 8:58 AM and 9:49 AM PST, customers experienced slower pipeline runs in the Prod2 cluster. Pipelines that started during this time took longer than usual to finish. All pipelines eventually completed, and no other parts of the product or service were affected.

Root Cause

During this period, there was a sudden and a sharp increase in the number of pipeline runs. This extra load slowed down the database, which in turn made the backend services slower. As a result, customers saw delays in their pipeline executions.

Impact

Pipelines that were started during the incident window ran slower than normal.
Individual stages and steps also took longer to begin, and this delay was visible in the Harness UI.

‌

Remediation

Immediate: Scaled database server to handle increased load

‌

Action Items

Optimize Expression Engine to reduce database usage so that our Databases have more headroom and can handle spikes
Optimize the recovery time by identifying the issue early and do preventive scale ups

Posted Sep 14, 2025 - 17:29 PDT

Resolved

This incident has been resolved.

Posted Sep 10, 2025 - 14:37 PDT

Update

We have identified and mitigated the issue, and are currently monitoring.

Posted Sep 10, 2025 - 09:47 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 10, 2025 - 09:46 PDT

This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO), Infrastructure as Code Management (IaCM)).