On January 26, 2026, the Pipeline service in the production environment (Prod2) experienced intermittent failures affecting certain pipeline-related views. The issue was triggered by elevated memory usage in a subset of service instances, which caused specific API requests to fail. The issue was identified quickly through automated monitoring and resolved the same day.
During the incident window, some customers may have experienced:
The issue did not impact pipeline execution itself. Pipelines continued to run successfully, and there was no data loss. The impact was limited to UI/API visibility of execution metadata for a subset of requests.
The issue was caused by memory pressure for certain heaving backend operations triggered.
Because the affected instances were not fully unhealthy and continued responding to basic health checks, automated readiness checks did not trigger a restart. As a result, the impacted instances remained in a partially degraded state until manual mitigation was performed.
Immediate mitigation steps included:
Service functionality was fully restored shortly after mitigation was applied.
To reduce the risk of recurrence and improve early detection, the following actions are being implemented: