A Deployment pipeline execution on Prod4 resulted in removal of few workload identities from DR cluster which are shared across Primary and DR clusters, this caused the pods in both the primary and DR clusters that depend on workload identity to fail, affecting service availability.
Customers faced issue with their CI/CD Pipeline where the pipeline started failing on Prod4environment
Time (UTC) | Event |
---|---|
January 24 12:40 AM UTC | During the Prod4 deployment, some services failed to come up healthy, and a FireHydrant incident was triggered. |
January 24 12:57 AM UTC | The issue was identified as missing workload identity bindings in the DR cluster. The team decided to redeploy to restore the configuration. |
January 24 1:30 AM UTC | The redeployment fixed the issue by syncing the Terraform state, which resolved the mismatch in the DR cluster configuration |
Re-deployment resolved the issue by ensuring that the Terraform state was aligned with the intended configuration, which restored the missing workload identity bindings in the DR cluster.
During a recent Disaster Recovery (DR) exercise, manual changes were made to the DR cluster but were not properly captured in the Terraform configuration. When the deployment pipeline executed, Terraform applied its last known state, inadvertently removing the workload identity bindings in the DR cluster. This led to pod failures in both the Primary and DR clusters, causing the CI/CD pipeline to fail in Prod4.
Action Item