Pipeline initialization failures observed while resolving environment variable expressions.
Incident Report for Harness
Postmortem

Summary

On 21st December at 1:10 PM PST, we received a report from 2 of our customers about issues with their pipeline executions in our Prod-2 cluster for their CI pipelines.

A firehydrant was triggered after some time for the same.

Timeline (PST)

Time Event
1:24 PM Confirmed no pipeline-service deployment was done and issue is observed only for few Prod2 and Prod1 CI customers.
1:28 PM Verified CI Automation was running fine but we were able to reproduce the issue
2:04 PM Prod2 CI service was rolled back to previous version, and confirmed with customers the issue is mitigated

Resolution

We rolled back the CI build in Prod2 cluster to unblock the customers.

Total Downtime

  • Downtime taken: No Downtime taken
  • Resolution time*: 1hour 46 minutes

  • Resolution time = time reported to time restored, either through Rollback or HF

RCA

There was a change in a common deserialiser -
where we added handling that if the value is a string of Json list example → "[1,2]" is given then it will be converted to List of String irrespective the field expecting it to be type String, thus its throwing Exception during execution.
This was mainly observed for a customer having this value set in their envVariables in RunStep in the CI stage.

Action Items

  1. Updating our customer setup automation to include this setup as well as any others to have our suite up to date so that with our feature development, existing customer setups are not impacted.
  2. Adding failover to code paths when making changes to existing flows to minimize the impact on existing running setups with new feature/bug development.
Posted Dec 25, 2023 - 22:15 PST

Resolved
The incident has been resolved. We plan to publish the Postmortem early next week. Our pipeline execution failure rate was less than 1% during this incident. As a result, no downtime was taken.
Posted Dec 21, 2023 - 14:37 PST
Monitoring
We have reverted the services back to the previous version and we are monitoring the results.
Posted Dec 21, 2023 - 14:11 PST
Update
We are continuing to investigate this issue.
Posted Dec 21, 2023 - 14:06 PST
Investigating
We are currently investigating this issue.
Posted Dec 21, 2023 - 14:05 PST
This incident affected: Prod 1 (Continuous Integration Enterprise(CIE) - Cloud Builds, Continuous Integration Enterprise(CIE) - Self Hosted Runners) and Prod 2 (Continuous Integration Enterprise(CIE) - Cloud Builds, Continuous Integration Enterprise(CIE) - Self Hosted Runners).