CI/STO Stage Failures

Incident Report for Harness

Postmortem

Summary
On June 6th, our SCM service experienced degraded performance that briefly impacted infrastructure pipeline operations. This was triggered by a high-volume onboarding of modules using SSH-based connectors via our Terraform provider, which caused an unexpected resource spike in core services. This effected all services that use the SCM service including CI, CD & STO.

Root Cause
The primary cause was a surge in SSH metadata validation commands executed concurrently by the one of the services. This overwhelmed the system’s ability to process requests, leading to service disruption.

Mitigation

  • We implemented rate limiting on module onboarding API calls (create, update, sync), reducing request frequency per pod, to ensure service availability.
  • Increased CPU allocation for IaC server pods to mitigate auto-scaling cascades.
  • Customers may notice that not all modules are created in a single run; reapplying the plan will complete the remaining modules.

Preventive and Long-Term Improvements

  • We are evaluating a queued onboarding system to smooth out load spikes and avoid manual scaling coordination.
  • Improvements in error handling and metadata synchronization logic are being addressed.
Posted Jul 24, 2025 - 22:47 PDT

Resolved

The issue has been resolved and we will post our findings in the post mortem.
Posted Jun 06, 2025 - 10:38 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jun 06, 2025 - 10:25 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted Jun 06, 2025 - 09:50 PDT

Investigating

We are currently investigating an issue for CI/STO stages getting stuck or aborted.
Posted Jun 06, 2025 - 09:30 PDT
This incident affected: Prod 2 (Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO)).