Slowness in Pipeline Execution graph UI

Incident Report for Harness

Postmortem

Summary

On 2/27/2026, customers experienced slowness when viewing running pipeline execution pages in the Harness UI. The issue was caused by delays in the processing of graph generation events used to generate the pipeline execution graph.

The degradation began around 7:33 AM PT and resulted in delayed updates and slow loading of pipeline execution views. The engineering team identified the underlying performance bottleneck, applied mitigation measures, and restored normal system behavior after stabilizing the event processing pipeline.

Root Cause

The incident was caused by a temporary backlog in Kafka consumers responsible for processing orchestration log events, which are used to generate the execution graph for running pipelines.

The backlog was triggered by increased system load combined with performance degradation in a shared Elasticsearch cluster used by the pipeline processing services. During the incident window, Elasticsearch experienced a sudden spike in indexing activity which caused resource contention and high CPU utilization on one of the cluster nodes.

This slowdown in Elasticsearch queries reduced the processing throughput of the Kafka consumers responsible for graph generation, resulting in accumulated consumer lag and delayed updates in the pipeline execution UI.

Impact

During the incident window:

  • Users experienced slow loading or delayed updates when viewing running pipeline execution pages.
  • The pipeline graph visualization and related execution details were slower to render.
  • Pipeline executions themselves continued to run normally, but the UI display of their progress was delayed.

Other Harness services and pipeline execution functionality were not impacted.

Mitigation

Engineering teams implemented several mitigation steps to restore system performance:

  • Scaled the Elasticsearch cluster to relieve resource pressure and improve query performance.
  • Scaled Kafka consumer capacity to accelerate backlog processing.

These actions improved consumer processing throughput and allowed the Kafka backlog to drain. Consumer lag reduced, after which the pipeline execution UI returned to normal responsiveness.

Prevention and Improvements

To reduce the likelihood of similar incidents in the future, the following improvements are being implemented:

  • Capacity planning improvements for shared Elasticsearch clusters supporting orchestration workloads.
  • Additional safeguards to prevent that can amplify indexing activity.

These measures will help ensure better isolation of workloads and faster detection of resource contention scenarios.

Posted Mar 04, 2026 - 11:50 PST

Resolved

This incident has been resolved.
Posted Feb 27, 2026 - 08:07 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Feb 27, 2026 - 08:05 PST

Identified

The issue has been identified and a fix is being implemented.
Posted Feb 27, 2026 - 07:50 PST

Investigating

We are currently investigating the issue. The impact is identified to be currently only in UI interface. Executions continue to work as expected.
Posted Feb 27, 2026 - 06:44 PST
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG)).