Intermittent slowness while running pipelines

Incident Report for Harness

Postmortem

Summary

On April 27, 2026, customers running pipelines in the Prod3 environment experienced intermittent slowness in pipeline execution and delays in execution status updates in the UI.

It was caused by a unexpected spike causing contention on a backend database supporting pipeline orchestration. The issue was mitigated and fully resolved.

Impact

Incident window: April 27, 2026, 1:00 PM – 3:12 PM PDT

  • Pipeline executions ran slower than normal; some executions took longer than expected to complete. For pipelines with stricter timeouts, there could be failures.
  • No widespread pipeline failures were observed
  • Execution view in the UI lagged behind real-time pipeline progress

There was no data loss. The majority of pipelines continued to execute successfully, with the primary impact being increased latency and delayed UI updates.

Root Cause

Pipeline orchestration relies on a backend database to track execution state and power the execution view in the UI.

During the incident, we had a spike of load, leading to increased query latency across the orchestration layer.This resulted in a backlog, causing UI updates to lag behind actual pipeline execution until the system was scaled.

Remediation

Immediate Mitigation

  • Scaled up the affected database instance to increase CPU capacity
  • Reduced query latency and eliminated lock contention
  • Cleared the execution-view update backlog within ~30 minutes

These actions restored normal pipeline performance and UI responsiveness.

Action Items

To prevent such issues from happening again.

  • Capacity Improvements:Updated Prod3 capacity baseline to prevent similar resource constraints
  • Proactive Detection: Enhancing monitoring and alerting for backend resource utilization, lock contention, and critical query latency
Posted Apr 29, 2026 - 12:58 PDT

Resolved

We were seeing slowness while executing pipelines
Posted Apr 27, 2026 - 13:00 PDT