Pipeline logging streams experiencing degraded delivery to console.

Incident Report for Harness

Postmortem

Summary

On May 27, 2026, customers using Harness Pipelines on the legacy log-storage backend experienced intermittent failures and missing pipeline logs in the UI. No log data was permanently lost

Root cause

The root cause was CPU saturation of the log service's underlying cache infrastructure, triggered by an automation script from one enterprise customer that opened a large number of long-lived log-streaming connections without closing them.

Mitigation

Immediate actions taken during the incident:

  1. Rate limiting applied Traffic from the automation generating the runaway connections was rate-limited at the network layer, reducing new connection creation and allowing the CPU to begin recovering.
  2. Streaming connection timeout reduced: The maximum lifetime for legacy streaming connections was reduced from 1 hour to 10 minutes across production environments, limiting how long any single connection can remain open and reducing the steady-state connection count.
  3. Affected customer migrated to newer storage infra: The enterprise customer most affected by the incident was migrated to the new log storage path infrastructure. Once migrated, their logs were immediately accessible, confirming that no log data had been lost. The cache CPU returned to normal levels (~90% sustained CPU dropped) following this migration.

Next Steps

The following actions are in progress or planned to prevent recurrence:

  • Rate limiting applied for the specific traffic pattern that triggered this incident. (Done)
  • Streaming connection timeout reduced to 10 minutes across primary production environments. (Done)
  • Complete migration to new log storage infra : The primary long-term fix is finishing the migration of all remaining accounts

  • Enhance monitoring alerts: A broader audit of all monitoring alert rules is underway to identify any other critical detection paths that may be silently disabled.

  • Extend streaming connection timeout reduction globally: The 10-minute maximum streaming connection lifetime is being rolled out to all remaining environments (including development and free-tier clusters) to ensure consistent protection.

  • Per-account rate limiting on streaming connections: We are adding an application-level cap on concurrent streaming connections per account. The initial implementation will be observe-only (logging warnings when thresholds are approached) to gather data on real-world usage before converting to hard enforcement.

  • Improved error messaging: When log operations fail due to underlying infrastructure issues (cache timeouts, I/O errors), the error surfaces to users and logs will be updated to accurately reflect the infrastructure cause, rather than showing a generic "stream not found" message that obscures the true reason for failure.

Posted Jun 23, 2026 - 20:08 PDT

Resolved

This incident has been resolved.
Posted May 27, 2026 - 16:20 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted May 27, 2026 - 07:41 PDT

Investigating

We are currently investigating this issue.
Posted May 27, 2026 - 07:20 PDT
This incident affected: Prod 3 (Platform).