On 2/27/2026, customers experienced slowness when viewing running pipeline execution pages in the Harness UI. The issue was caused by delays in the processing of graph generation events used to generate the pipeline execution graph.
The degradation began around 7:33 AM PT and resulted in delayed updates and slow loading of pipeline execution views. The engineering team identified the underlying performance bottleneck, applied mitigation measures, and restored normal system behavior after stabilizing the event processing pipeline.
The incident was caused by a temporary backlog in Kafka consumers responsible for processing orchestration log events, which are used to generate the execution graph for running pipelines.
The backlog was triggered by increased system load combined with performance degradation in a shared Elasticsearch cluster used by the pipeline processing services. During the incident window, Elasticsearch experienced a sudden spike in indexing activity which caused resource contention and high CPU utilization on one of the cluster nodes.
This slowdown in Elasticsearch queries reduced the processing throughput of the Kafka consumers responsible for graph generation, resulting in accumulated consumer lag and delayed updates in the pipeline execution UI.
During the incident window:
Other Harness services and pipeline execution functionality were not impacted.
Engineering teams implemented several mitigation steps to restore system performance:
These actions improved consumer processing throughput and allowed the Kafka backlog to drain. Consumer lag reduced, after which the pipeline execution UI returned to normal responsiveness.
To reduce the likelihood of similar incidents in the future, the following improvements are being implemented:
These measures will help ensure better isolation of workloads and faster detection of resource contention scenarios.