Pipeline Executions on PROD1 are running slow

Incident Report for Harness

Postmortem

Summary

Between 2025-09-29 11:28 AM PST and 2025-09-29 12:50 PM PST, some pipelines in the Prod1 environment experienced slower execution times due to a temporary service performance issue. The issue was mitigated through a configuration change that restored normal performance.

Root Cause

A recent optimization intended to improve system efficiency unexpectedly created a resource imbalance, slowing down internal processing and leading to overall pipeline delays.

Impact

During the incident, customers experienced slower pipeline executions, but there were no functional errors or job failures.

Remediation

Immediate: Reverted the optimization configuration to stabilize performance.

Permanent: Applied system-level improvements to ensure balanced workload distribution across services.

Action Items

  • Improve system resilience: Adjust configuration handling to avoid similar resource imbalances.

  • Enhance monitoring: Strengthen internal metrics to detect early signs of performance degradation.

Posted Oct 08, 2025 - 14:51 PDT

Resolved

This incident has been resolved.
Posted Sep 29, 2025 - 16:54 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Sep 29, 2025 - 12:37 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted Sep 29, 2025 - 12:29 PDT

Update

We are continuing to investigate this issue.
Posted Sep 29, 2025 - 12:28 PDT

Investigating

We are currently investigating this issue.
Posted Sep 29, 2025 - 12:19 PDT
This incident affected: Prod 1 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO)).