All clusters experiencing feature loss or degradation of functionality due to our sub-provider functionality being degraded

Incident Report for Harness

Postmortem

Summary

At approximately 18:04 UTC, our internal systems triggered automatic availability alerts due to a widespread Google Cloud Platform (GCP) outage. The outage affected multiple services both regionally and globally, causing our uptime monitoring checks to fail and initiating on-call paging. As a result, we saw degradations across Harness SaaS offerings (please refer to our status page for detailed uptime per cluster)

Root Cause

Harness relies on Google Cloud for hosting services. On June 12, GCP experienced a major outage impacting many of its services globally, including:

  • Identity and Access Management (IAM)
  • Cloud Logging
  • Cloud Monitoring
  • BigQuery
  • Google Cloud Console
  • Cloud SQL
  • Cloud Storage
  • Compute Engine
  • Identity Platform

More details: GCP Incident Report

Additionally, one of our services (Feature Flag) took longer to recover because some clients had retry logic implemented, which overwhelmed the system with a high volume of requests during recovery.

Mitigation

  • Paused all internal deployment pipelines to prevent further load during the outage.
  • Implemented rate limiting to throttle client retries and protect downstream services.
  • All Harness services resumed normal operation once the GCP outage was resolved.

Action Items

  1. Implement Multi-Cloud Strategy: To improve resilience and ensure continuity of service during provider-specific outages, Harness is actively working toward a multi-cloud architecture.
  2. Service-Level Rate Limiting: Rate limiting mechanisms have been fully implemented across services to ensure faster recovery and avoid overload during future incidents
Posted Jul 11, 2025 - 12:25 PDT

Resolved

All services are fully recovered.
Posted Jun 12, 2025 - 19:54 PDT

Update

We are back up online and are monitoring
Posted Jun 12, 2025 - 14:39 PDT

Monitoring

We are back up online and are monitoring, feature flag in prod2 is in degraded state
Posted Jun 12, 2025 - 14:36 PDT

Update

We are continuing to work on a fix for this issue.
Posted Jun 12, 2025 - 14:20 PDT

Update

We are back up online and are monitoring, feature flag in prod2 is in degraded state
Posted Jun 12, 2025 - 13:39 PDT

Update

We are continuing to work on a fix for this issue.
Posted Jun 12, 2025 - 13:06 PDT

Update

We are continuing to work on a fix for this issue.
Posted Jun 12, 2025 - 12:33 PDT

Identified

The issue has been identified with the subprovider
Posted Jun 12, 2025 - 11:56 PDT

Investigating

We are currently investigating this issue.
Posted Jun 12, 2025 - 11:51 PDT
This incident affected: Prod 1 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository), Prod 2 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository), Prod 3 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Chaos Engineering, Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository), Prod 4 (Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Chaos Engineering, Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Code Repository), and Prod Eu1 (Continuous Delivery - Next Generation (CDNG), Chaos Engineering, Cloud Cost Management (CCM), Continuous Error Tracking (CET), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Engineering Insights (SEI)).