CI Pipelines with Step Groups encountering issues to start with "Error fetching plan creation response from service"
Incident Report for Harness
Postmortem

Impact

CI Pipelines which had step group in them started to fail for customers with plan creation failure error.

RCA

Each module needs to register with the pipeline service on the steps/step groups/stages it can execute. If the registration is not done, the pipeline service broadcasts events to identify the service which can execute the steps/step groups/stages. The first service which acknowledges the event is provided the handle to execute the steps/step groups/stages. During the incident new module, IACM was deployed to production. When the next execution for step group came in, the pipeline service broadcasted an event to identify the service to execute the step group. Since IACM responded first it was given control to execute the step group. Further IACM did not have the context to execute CI step group causing the failure.

Incident timeline:

All times are in PDT on March 14, 2023.

10:14 - We got the P1 issue from the customer regarding pipeline failures and the Harness status page has been updated accordingly.

10:16 - The issue has been identified and started working on fixing the issue.

10:25 - Fix has been implemented and started monitoring the result.

10:29 - The incident has been resolved in the Harness status page after validating that the issue has been fixed.

Remediation:

  • shutdown iacm-manager service
  • Refreshed pipeline service cache

Action items:

  • Module affinity should be honored by Pipeline Service based on stage module details. Only if that module can’t execute the step broadcast should happen to validate other services which can execute them.
  • Adding controls to ensure all modules register the steps/step groups/stages with pipeline service.
  • All dependent services should ensure Automation run for dependent services are executed before release.
Posted Mar 17, 2023 - 05:42 PDT

Resolved
This incident has been resolved.
Posted Mar 14, 2023 - 10:29 PDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 14, 2023 - 10:25 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Mar 14, 2023 - 10:16 PDT
Investigating
We are currently investigating this issue.
Posted Mar 14, 2023 - 10:14 PDT
This incident affected: Prod 2 (Continuous Integration Enterprise(CIE) - Cloud Builds).