Pipelines that reference the secrets are experiencing failures in the Prod-2 cluster. Impact - Every pipeline step that refers to a secret will cause the pipeline execution to fail

Incident Report for Harness

Postmortem

Incident

On December 7th, starting around 9 PM (All times UTC), Harness experienced an outage that affected pipelines in our Prod2 environment. Specifically, CI and CD pipelines in NextGen which used secrets were failing during execution. There was an intermittent downtime for FirstGen pipelines too for the duration of Services restart events. The incident was resolved on December 8th at 3:01 AM.

Timeline

Time	Event
Dec 7, 9:13 PM	First customer reported issue. Triaged as likely a result of a separate ongoing incident.
Dec 7, 10:34 PM	Incident acknowledged as independent of separate incident, and incident called
Dec 8, 2:13 AM	Root cause identified
Dec 8, 3:01 AM	Incident resolved

Response

Performance degradation and execution failure issues were reported across Continuous Integration (CI) and Continuous Deployment (CD) pipelines starting at 9:13 PM on Dec 7th. A high severity incident was declared at 10:30 PM.

Root Cause

Background:

Harness uses connectors to external secret managers (e.g. Google Secret Manager or Hashicorp Vault) to resolve/store secrets used by pipelines and elsewhere in the Harness platform. External secret manager connectors require configuration, including a means to authenticate to the external Secret Manager.

Sequence of Events

A customer configured their Secret Manager Connector to use a secret stored in the same Secret Manager. This issue was not apparent at the time of the change to either Harness or the User, because validation rules did not catch this issue.
Several hours later a pipeline was run that referenced a secret contained in that secret manager.
The pipeline execution tried to resolve the secret. Secret resolution created a recursive loop, filling the threadpool devoted to secret resolution.
Threadpool exhaustion stalled secret resolution across the environment. End-users experienced this stall as pipeline failures because failed secret resolution fails a pipeline

Mitigation and Remediation

Mitigation consisted of:

Updating the faulty configuration to break the self-dependency
Aborting the affected in-flight pipeline executions
Scaling all replicas of the service which manages secret resolution to zero to stop the job from being picked back up by the scheduler. Note that redeploying or restarting the service didn’t fix the issue because any surviving replica would instantly poison the others.

A Hotfix has been released to ensure configuration validation includes checking for self-reference.

Followup/Action Items

Improve fault isolation and layering between services in a way that makes causal issues easier to detect.
Our observability systems were operational and functioning normally, however they were not configured to alert on this type of issue. We will be implementing two classes of fixes across the platform:
- 1) Log-volume based alerting. Although this would not have identified the specific issue sooner, it would have decreased time to detection.
- 2) Close the loop between observability metrics and thresholds for alerting on those metrics. As metrics are added, they need to have thresholds for alerting configured at the same time and adjusted as needed, rather than creating metrics, and configuring alerting in a separate workstream. An alert on thread pool size would have greatly reduced the incident resolution time.
Our incident response playbooks include triage steps for individual modules, and steps for fault isolation at the platform level, but didn’t fully cover the scope of actions needed to isolate this issue. We will enhance our playbooks to provide additional depth for platform-level triage.

We understand that the Harness platform is mission critical for our customers. We are committed to living up to our promise of reliability and availability. We are determined to learn from this incident and make the necessary improvements to meet our shared world-class standards. Your trust is of utmost importance, and we appreciate your understanding.

Posted Dec 08, 2023 - 16:21 PST

Resolved

This incident has been resolved, the impacted components were Prod 2 - Continuous Delivery - Next Generation (CDNG) and Continuous Integration Enterprise(CIE) - Self Hosted Runners.

Posted Dec 07, 2023 - 19:18 PST

Update

We are continuing to monitor for any further issues.

Posted Dec 07, 2023 - 19:03 PST

Monitoring

The incident is now resolved. Detailed RCA to follow.

Posted Dec 07, 2023 - 19:01 PST

Update

We have identified the root cause and we are in the process of recovering.

Posted Dec 07, 2023 - 18:13 PST

Update

The secret decryption task is failing and we are looking into a recovery

Posted Dec 07, 2023 - 18:08 PST

Update

We are rolling back the services to the previously deployed version. We will keep you updated on the progress.

Posted Dec 07, 2023 - 17:44 PST

Update

We are currently working on debugging the issue. We have identified that there may be a problem with the gRPC calls between services. We will keep you updated on the progress.

Posted Dec 07, 2023 - 16:23 PST

Update

We are currently in the process of identifying the incident. As soon as it is identified, we will provide an update.

Posted Dec 07, 2023 - 15:44 PST

Update

We are continuing to work on a fix for this issue.

Posted Dec 07, 2023 - 14:36 PST

Identified

We continue to look into the issue and are considering rolling back the latest deployment

Posted Dec 07, 2023 - 14:35 PST

Investigating

Pipelines that reference the secrets are experiencing failures in the Prod-2 cluster starting around 9:13 PM UTC. Harness team started looking into the issue and a high severity incident was declared at 10:34 PM UTC.

Posted Dec 07, 2023 - 14:34 PST

This incident affected: Prod 2 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Mac Cloud Builds).