The Pipeline Executions list page experienced intermittent 500 errors

Incident Report for Harness

Postmortem

Summary

On January 26, 2026, the Pipeline service in the production environment (Prod2) experienced intermittent failures affecting certain pipeline-related views. The issue was triggered by elevated memory usage in a subset of service instances, which caused specific API requests to fail. The issue was identified quickly through automated monitoring and resolved the same day. 

Impact

During the incident window, some customers may have experienced:

  • Intermittent failures when loading the Execution List or Execution Details pages.
  • Occasional issues accessing Retry History.
  • Rare intermittent failures loading the Pipeline List page.

The issue did not impact pipeline execution itself. Pipelines continued to run successfully, and there was no data loss. The impact was limited to UI/API visibility of execution metadata for a subset of requests.

Root Cause

The issue was caused by memory pressure for certain heaving backend operations triggered.

Because the affected instances were not fully unhealthy and continued responding to basic health checks, automated readiness checks did not trigger a restart. As a result, the impacted instances remained in a partially degraded state until manual mitigation was performed.

Mitigation

Immediate mitigation steps included:

  • Performing a rolling restart of the affected service instances, which restored normal operation.
  • Enabling configuration to automatically terminate service instances upon OutOfMemoryError, ensuring degraded instances do not remain in a bad state.

Service functionality was fully restored shortly after mitigation was applied.

Action Items

To reduce the risk of recurrence and improve early detection, the following actions are being implemented:

  • Enforce automatic JVM termination on OutOfMemoryError to ensure faster self-recovery.
  • Implement enhanced and granular monitoring and alerting for application-level heap memory utilization.
  • Improve health check logic to better detect partially degraded service states.
Posted Feb 10, 2026 - 23:01 PST

Resolved

This incident has been resolved
Posted Jan 26, 2026 - 14:13 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jan 26, 2026 - 13:55 PST

Investigating

The pipeline executions themselves were not impacted. This was a UI issue.
Posted Jan 26, 2026 - 13:20 PST
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO)).