Intermittent error while loading pipeline execution history page

Incident Report for Harness

Postmortem

Summary

On January 9, 2026, some customers experienced intermittent errors and slow responses while accessing pipeline execution details and execution lists. The issue was identified promptly and mitigated by the engineering team. Service functionality was restored within a short period.

Impact

During the incident window, a subset of users may have encountered:

  • Intermittent failures or delays when loading pipeline execution details.
  • Occasional issues viewing execution history or execution lists.

Pipeline executions themselves continued to run as expected. There was no data loss, and no long-term impact to customer environments.

Root Cause

The issue was caused by elevated memory usage in a subset of service instances under load. When available memory dropped below required thresholds, certain requests related to loading execution data could not be processed successfully. Because the service instances remained partially healthy, they were not immediately recycled, resulting in intermittent request failures until mitigation was applied.

Mitigation

We did a rolling restart of all pods. That immediately fixed the issue. As a preventive measure we also increased the pod heap

Action Items

To prevent recurrence and improve resiliency, the following actions are being implemented:

  • Increased memory allocation to affected services to better handle peak load conditions.
  • Improved automatic recovery behavior for services encountering unrecoverable memory conditions.
  • Enhanced monitoring and alerting for application-level memory usage to enable earlier detection.
  • Added additional safeguards to ensure degraded instances are identified and remediated more quickly.
Posted Jan 28, 2026 - 11:25 PST

Resolved

This incident has been resolved.
Posted Jan 09, 2026 - 10:30 PST

Update

We are continuing to monitor for any further issues.
Posted Jan 09, 2026 - 10:15 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jan 09, 2026 - 10:00 PST
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO)).