Code Module is not accessible on prod1/2/3

Incident Report for Harness

Postmortem

Summary

Between 4:20 PM and 5:16 PM EST on Thursday, March 5, 2026, customers using the Harness Code modules experienced a production outage in Harness production clusters Prod1, Prod2, and Prod3. Git repositories were unreachable during this outage.

Root Cause

We experienced a surge in metrics that overwhelmed the metric collectors on the Kubernetes pods. As a result, the Git pods were impacted. The StatefulSet became unschedulable, and resizing of the metric collectors was required to remedy the situation.

Impact

All code repositories were offline during the event across all three production clusters.

Remediation

Engineering increased the memory allocated to the metric collectors and redeployed the configuration. After redeployment, the Git pods were rescheduled and service was restored.

Action Items

To prevent such issues from happening, we are implementing the following:

  • Enhance monitoring and alerting – Add health monitors for metric-gathering collectors and rebalance metric growth across the cluster.
  • Review capacity planning – Proactively monitor metric collector usage and scale them appropriately with sufficient headroom to handle spikes.
Posted Mar 09, 2026 - 13:34 PDT

Resolved

This incident has been resolved.
Posted Mar 05, 2026 - 14:05 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 05, 2026 - 14:01 PST

Identified

The issue has been identified and a fix is being implemented.
Posted Mar 05, 2026 - 13:22 PST

Investigating

We are currently investigating this issue.
Posted Mar 05, 2026 - 13:21 PST
This incident affected: Prod 1 (Code Repository), Prod 2 (Code Repository), and Prod 3 (Code Repository).