Intermittent failures for hosted CI build
Incident Report for Harness
Postmortem

Overview

Our hosted builds operate within a multicloud environment, utilizing virtual machines (VMs). We have implemented a fallback to address scenarios where a VM fails to initialize in a specific cloud or region. We received a notification regarding a customer pipeline experiencing initialization failures. This issue arose in specific cases where our primary failed and our fallback on secondary failed as well.

Time (PST) Event
12/07/2023 11:40 AM Notified of initialization failures
12/07/2023 2:40 PM Added additional fallback to another region to fix initialization failures

Resolution

We added multiple levels of fallback to mitigate the VM initialization issue.

Affected Accounts

A total of 15 customer were affected by this. Only a few pipeline executions were failing because of this and hence we are taking a partial outage of 3 hours for Prod1 and 2 hours 38 minutes for Prod2.

RCA

All our fallbacks to initialize VMs to use for Harness CI build pipelines were failing. We added additional fallbacks to mitigate the VM initialization issue and worked on fixing the issues causing a higher failure rate in the current fallbacks

Action Items

  • Review fallbacks for VM initialization and make them fail safe.
  • Improve alerting for VM initialization failures. Current alerting was for a percentage failure rate. It has been updated to alert for even a single VM initialization failure
Posted Dec 12, 2023 - 09:59 PST

Resolved
We have added an additional fallback regions to Prod 1 and 3 and now issue is resolved.
Posted Dec 07, 2023 - 14:41 PST
Update
The issue is resolved for prod 2, we have added an additional fallback region.
We are actively working on rolling this out to prod 1 and prod 3.
Posted Dec 07, 2023 - 14:19 PST
Update
We are continuing to work on a fix for this issue.
Posted Dec 07, 2023 - 14:17 PST
Identified
We are currently experiencing availability issues with our fallback region for cloud builds. This issue occurs intermittently when we switch from fallback from primary due to any errors. The team is actively working on adding an additional fallback option to another region in order to address this issue.
Posted Dec 07, 2023 - 13:27 PST
Investigating
We are experiencing intermittent failures on cloud builds due to a resource issue. We are currently investigating.
Posted Dec 07, 2023 - 12:08 PST
This incident affected: Prod 3 (Continuous Integration Enterprise(CIE) - Cloud Builds), Prod 1 (Continuous Integration Enterprise(CIE) - Cloud Builds), and Prod 2 (Continuous Integration Enterprise(CIE) - Cloud Builds).