CI/STO Stage Failures

Incident Report for Harness

Postmortem

Summary
On June 6th, our SCM service experienced degraded performance that briefly impacted infrastructure pipeline operations. This was triggered by a high-volume onboarding of modules using SSH-based connectors via our Terraform provider, which caused an unexpected resource spike in core services. This effected all services that use the SCM service including CI, CD & STO.

Root Cause
The primary cause was a surge in SSH metadata validation commands executed concurrently by the one of the services. This overwhelmed the system’s ability to process requests, leading to service disruption.

Mitigation

We implemented rate limiting on module onboarding API calls (create, update, sync), reducing request frequency per pod, to ensure service availability.
Increased CPU allocation for IaC server pods to mitigate auto-scaling cascades.
Customers may notice that not all modules are created in a single run; reapplying the plan will complete the remaining modules.

Preventive and Long-Term Improvements

We are evaluating a queued onboarding system to smooth out load spikes and avoid manual scaling coordination.
Improvements in error handling and metadata synchronization logic are being addressed.

Posted Jul 24, 2025 - 22:47 PDT

Resolved

The issue has been resolved and we will post our findings in the post mortem.

Posted Jun 06, 2025 - 10:38 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jun 06, 2025 - 10:25 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted Jun 06, 2025 - 09:50 PDT

Investigating

We are currently investigating an issue for CI/STO stages getting stuck or aborted.

Posted Jun 06, 2025 - 09:30 PDT

This incident affected: Prod 2 (Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO)).