Prod2: CI/STO Pipeline Execution failing for Customers

Incident Report for Harness

Postmortem

Summary:

Customer encountered an issue with pipeline execution. The executions failed with an exception “Error Creating Plan: Could not create plan for node“. This impacted the CI and STO stages execution.

Timeline:

Time (IST) Event
March 6, 2025, 4:02 AM UTC Team reviewed the series of events for a previous Incident and since the load on pipeline runs were lower we decided to rollback the release to 1.66.1.
March 6, 2025, 9:03 AM UTC Customers reported that they are intermittently unable to run CI pipelines with plan creation error
March 6, 2025, 9:17 AM UTC Status page was updated
March 6, 2025, 9:25 AM UTC Identified the gap in licensing API that led to cache corruption.
March 6, 2025, 10:00 AM UTC CI manager deployment to version 1.67.3 ( prod 2) was done and the errors stopped.
March 6, 2025, 10:31 AM UTC Got customer confirmation that CI is now operational.
March 6, 2025, 11:45 AM UTC STO errors were still occurring due to rollback
March 6, 2025, 2:14 PM UTC STO was rolled forward to 1.54 version

Resolution:

STO service was rolled forward to 1.54 version to resolve the issue.

RCA:

When a pipeline execution is triggered we check the License details for the module and verify a valid license exists. As part of this check we ran into an issue for unknown license type which triggered an exception causing the pipeline execution failure. The license details API had a gap in the license details fetch call, which when encountered corrupted the cache for the consecutive executions with non onboarded license types.

Action Items:

  • Improvement in alerting for plan creation errors
  • Improve automation tests to cover advanced filtering scenarios for the licensing API
Posted Mar 11, 2025 - 13:51 PDT

Resolved

This incident has been resolved.
Posted Mar 06, 2025 - 03:05 PST

Monitoring

We have applied the fix (internal test passed), services are restored.
Posted Mar 06, 2025 - 02:17 PST

Identified

We are working on the fix.
Posted Mar 06, 2025 - 02:00 PST

Update

We are also seeing intermittent new pipeline creation failure. We are currently investigating.
Posted Mar 06, 2025 - 01:31 PST

Investigating

Pipeline Execution failing for Customers in Prod2.
Posted Mar 06, 2025 - 00:50 PST
This incident affected: Prod 2 (Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA)).