Postmortem
Summary
Google experienced an incident on February 25th with Compute Engine in the us-west1-a zone, where some nodes, specifically E2 and N1 types, would reboot. The reboot caused the ungraceful restart of containers on the affected nodes.
Resolution
Our monitoring systems alerted us to the issue. In response, we decided to be proactive and utilize nodeAffinity to remove core service workloads from the us-west1-a zone in the affected environments until Google resolved the issue and to mitigate potential customer impact.
RCA
Google has yet to post an RCA for their incident, but a small blurb from the resolved incident page states, “From preliminary analysis, the issue was due to a latent bug that manifested under specific conditions, which resulted in unexpected VM reboots in the us-west1-a zone.”
Action Items
There was no known customer impact due to this incident because our workloads are multi-zonal, and our actions were entirely proactive to prevent possible impact.
Posted Apr 23, 2025 - 13:48 PDT
Resolved
Google has marked their incident as resolved and stated that VMs utilizing the us-west1-a zone are fully operational again.
Posted Feb 25, 2025 - 11:50 PST
Update
Google has marked their incident as resolved and stated that VMs utilizing the us-west1-a zone are fully operational again.
Posted Feb 25, 2025 - 11:48 PST
Update
GCP is experiencing an issue with VMs in the us-west1-a zone. At this time, we've migrated our critical workloads out of this zone to negate any customer impact, and we are fully operational. We will continue to monitor the GCP incident in the event the scope changes.
Posted Feb 25, 2025 - 11:09 PST
Monitoring
We are experiencing an issue with Google Compute Engine beginning at Monday, 2025-02-25 01:41 UTC.
This is causing some services to intermittently restart, resulting in some workloads terminating unexpectedly
Our engineering team is working with GCP to investigate the issue, and will post updates as we receive them from Google.
Posted Feb 25, 2025 - 06:54 PST
This incident affected: Prod 1 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository), Prod 2 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository), Prod 3 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Chaos Engineering, Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository), Prod 4 (Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Chaos Engineering, Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Code Repository), Software Engineering Insights FirstGen (fka Propelo) (Software Engineering Insights FirstGen (fka Propelo) - EU, Software Engineering Insights FirstGen (fka Propelo) - US), and Service Reliability Management - Error Tracking FirstGen (fka OverOps).