On December 7th, starting around 9 PM (All times UTC), Harness experienced an outage that affected pipelines in our Prod2 environment. Specifically, CI and CD pipelines in NextGen which used secrets were failing during execution. There was an intermittent downtime for FirstGen pipelines too for the duration of Services restart events. The incident was resolved on December 8th at 3:01 AM.
Time | Event |
---|---|
Dec 7, 9:13 PM | First customer reported issue. Triaged as likely a result of a separate ongoing incident. |
Dec 7, 10:34 PM | Incident acknowledged as independent of separate incident, and incident called |
Dec 8, 2:13 AM | Root cause identified |
Dec 8, 3:01 AM | Incident resolved |
Performance degradation and execution failure issues were reported across Continuous Integration (CI) and Continuous Deployment (CD) pipelines starting at 9:13 PM on Dec 7th. A high severity incident was declared at 10:30 PM.
Harness uses connectors to external secret managers (e.g. Google Secret Manager or Hashicorp Vault) to resolve/store secrets used by pipelines and elsewhere in the Harness platform. External secret manager connectors require configuration, including a means to authenticate to the external Secret Manager.
Mitigation consisted of:
A Hotfix has been released to ensure configuration validation includes checking for self-reference.
Our observability systems were operational and functioning normally, however they were not configured to alert on this type of issue. We will be implementing two classes of fixes across the platform:
Our incident response playbooks include triage steps for individual modules, and steps for fault isolation at the platform level, but didn’t fully cover the scope of actions needed to isolate this issue. We will enhance our playbooks to provide additional depth for platform-level triage.
We understand that the Harness platform is mission critical for our customers. We are committed to living up to our promise of reliability and availability. We are determined to learn from this incident and make the necessary improvements to meet our shared world-class standards. Your trust is of utmost importance, and we appreciate your understanding.