CD and CI pipelines that reference secrets are experiencing failures.
Incident Report for Harness
Postmortem

Incident

On December 11th, starting at 4:25 PM (All times UTC), Harness had a service outage that affected pipelines in our Prod2 environment. Specifically, CI and CD pipeline executions in NextGen which used secrets failed. The incident was resolved on December 11, 4:50 PM.

This incident is related to the incident from last week.

Timeline

Time Event
Dec 11, 4:25 PM Harness detected pipelines were failing to resolve secrets.
Dec 11, 4:28 PM Incident was acknowledged, and the P0 incident called
Dec 11, 4:35 PM Root cause identified
Dec 11, 4:50 PM Incident resolved

Root Cause

Background

Harness uses connectors to external secret managers (e.g. Google Secret Manager or Hashicorp Vault) to resolve/store secrets used by pipelines and elsewhere in the Harness platform. External secret manager connectors require configuration, including a means to authenticate to the external Secret Manager. 

On 2023-12-07, there was an incident where a bad secret manager configuration was leading to thread exhaustion. To mitigate that incident, we updated the faulty configuration in the database and restarted the affected services. The instant incident was a downstream of the incident from earlier this week.

Mitigation and Remediation

  • In the prior incident, we manually updated the config that controlled the broken secret manager connector. In this cleanup, we unintentionally left a dangling database entry. Had we updated the connector by API, this entry would have been cleaned up correctly
  • After discovery, we deleted the secret through API and restarted the affected services.

Followup/Action Items

  • On Friday, we rolled a hotfix to prevent the creation of such faulty configuration. However, it did not help in this case since it was an existing configuration.
  • There was additional runtime validation in the works, which detects the self reference when secret is used in pipeline execution. Since then, it has also been rolled out.
Posted Dec 13, 2023 - 11:02 PST

Resolved
This incident has been resolved.
Posted Dec 11, 2023 - 09:04 PST
Monitoring
The issue has been resolved. We are continuing to monitor the incident.
Posted Dec 11, 2023 - 09:01 PST
Identified
The issue has been identified and a fix is being implemented.
Posted Dec 11, 2023 - 08:42 PST
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners).