We want to share the details about the incident where the pipelines could not advance after a step was completed. This impacted the deployments and builds in Prod-1 and Prod-2 clusters between 8:28 AM — 9:48 AM PT on Nov 30th, 2022. Next Gen Continuous Delivery, Continuous Integration, Service Reliability Management, Feature Flags, and Security Testing Orchestration were the modules that got impacted. Harness Current Gen Modules were not affected.
Harness pipeline service relies on a third-party in-memory database provider. A rollout of the wrong configuration due to human error by third-party personnel caused the harness pipeline service failure. The vendor initiated a project to replace the self-signed server certificate with a signed certificate by GlobalSign across their fleet. They executed the first step for some of the non-TLS-enabled database clusters. By mistake, Harness clusters got added to the batch resulting in an outage since the client didn’t trust the new certificate.
The vendor reverted their incorrect config changes by rolling back the server certificate across the Harness clusters.