CI/CD Pipeline failure on Prod4

Incident Report for Harness

Postmortem

Summary:

A Deployment pipeline execution on Prod4 resulted in removal of few workload identities from DR cluster which are shared across Primary and DR clusters, this caused the pods in both the primary and DR clusters that depend on workload identity to fail, affecting service availability.

What was the issue?

Customers faced issue with their CI/CD Pipeline where the pipeline started failing on Prod4environment

Timeline:

Time (UTC) Event
January 24 12:40 AM UTC During the Prod4 deployment, some services failed to come up healthy, and a FireHydrant incident was triggered.
January 24 12:57 AM UTC The issue was identified as missing workload identity bindings in the DR cluster. The team decided to redeploy to restore the configuration.
January 24 1:30 AM UTC The redeployment fixed the issue by syncing the Terraform state, which resolved the mismatch in the DR cluster configuration

Resolution:

Re-deployment resolved the issue by ensuring that the Terraform state was aligned with the intended configuration, which restored the missing workload identity bindings in the DR cluster.

RCA

During a recent Disaster Recovery (DR) exercise, manual changes were made to the DR cluster but were not properly captured in the Terraform configuration. When the deployment pipeline executed, Terraform applied its last known state, inadvertently removing the workload identity bindings in the DR cluster. This led to pod failures in both the Primary and DR clusters, causing the CI/CD pipeline to fail in Prod4.

Action Item

  1. Ensure no manual changes are made to the system. In case of unforeseen manual changes, document them and incorporate them into Terraform.
  2. Automate Drift Detection: Implement automated drift detection to identify discrepancies between the live infrastructure and Terraform state.
  3. Pre-Deployment Validations: Introduce additional pre-deployment checks to verify workload identity bindings before applying changes.
Posted Mar 06, 2025 - 11:53 PST

Resolved

This incident has been resolved.
Posted Jan 23, 2025 - 17:32 PST

Investigating

The Harness service is experiencing performance issues. We are working to identify the cause and restore normal operations as soon as possible.
Posted Jan 23, 2025 - 16:40 PST
This incident affected: Prod 4 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO)).