On December 11th, starting at 4:25 PM (All times UTC), Harness had a service outage that affected pipelines in our Prod2 environment. Specifically, CI and CD pipeline executions in NextGen which used secrets failed. The incident was resolved on December 11, 4:50 PM.
This incident is related to the incident from last week.
Time | Event |
---|---|
Dec 11, 4:25 PM | Harness detected pipelines were failing to resolve secrets. |
Dec 11, 4:28 PM | Incident was acknowledged, and the P0 incident called |
Dec 11, 4:35 PM | Root cause identified |
Dec 11, 4:50 PM | Incident resolved |
Harness uses connectors to external secret managers (e.g. Google Secret Manager or Hashicorp Vault) to resolve/store secrets used by pipelines and elsewhere in the Harness platform. External secret manager connectors require configuration, including a means to authenticate to the external Secret Manager.
On 2023-12-07, there was an incident where a bad secret manager configuration was leading to thread exhaustion. To mitigate that incident, we updated the faulty configuration in the database and restarted the affected services. The instant incident was a downstream of the incident from earlier this week.
Mitigation and Remediation