Gitops service was impacted in Prod-3
Incident Report for Harness
Postmortem

Summary

The GitOps project overview page was unable to load for select customers following the upgrade.

Root Cause

One customer with many applications led to an API call which initiated the synchronisation leading to a heavy load on our database. This resulted in some locks in our database which impacted the rollback of the changes introduced by the upgrade.

Timeline

TIME EVENTS
1st July 2024 10:00 AM UTC Received a customer escalation.
1st July 2024 10:20 AM UTC Rolled back the upgrade, but GitOps remained down.
1st July 2024 10:31 AM UTC Identified the issue and initiated a new deployment.
1st July 2024 10:31 AM UTC The GitOps service has been restored.

Resolution

We extended the synchronisation duration and manually removed the locks, allowing the pods to start. This action restored the GitOps services.

Follow-up action items

  1. Increase the duration of AppSyncs. 
  2. Add a control to prevent the AppSync when necessary.
  3. Implement the capability to stop traffic from a specific customer, if required.
Posted Sep 05, 2024 - 03:03 PDT

Resolved
Gitops service was impacted during an internal release when the service was unavailable for a short while. The issue was resolved on rollback.
Posted Jul 01, 2024 - 03:43 PDT
This incident affected: Prod 3 (Continuous Delivery - Next Generation (CDNG)).