Users from Prod 1 and Prod 2 clusters were unable to access Harness Platform
Incident Report for Harness
Postmortem

Summary:

  • On December 22nd, a load test triggered by the Harness Performance Team caused a full outage of app.harness.io for approximately 9 minutes (12:36 PM UTC to 12:45 PM UTC).
  • This impacted customers in Prod1 and Prod2 clusters.
  • Prod3 cluster was unaffected due to its separate component affected by outage.

Impact:

  • Customers in Prod1 and Prod2 clusters were unable to access app.harness.io for 9 minutes.

Root Cause:

  • During the high volume of traffic from the load test, the component ("Kubernetes Ingress Controller") responsible for managing incoming requests and routing them to the correct internal services became overloaded.
  • This caused the ingress controller to become unhealthy, leading to the outage.

Resolution:

  • The system automatically recovered without manual intervention.

Action Items:

  • Resource Scaling: We are exploring options to automatically scale the ingress controller based on demand to handle high traffic volumes more effectively.

We understand the importance of a reliable platform for your operations and sincerely apologize for any inconvenience caused by this incident. Our team is dedicated to ensuring the continued improvement of the Harness platform’s performance and reliability. We appreciate your trust and remain committed to providing you with a seamless experience.

Posted Dec 25, 2023 - 22:44 PST

Resolved
There was a downtime observed with the Harness platform for the Prod1 and Prod 2 clusters. The login page was not accessible and 502 errors were returned.
We are investigating the root cause of the issue and will post RCA here. All functionalities are restored now.
Posted Dec 22, 2023 - 04:36 PST
This incident affected: Prod 1 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Cloud Builds, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM)) and Prod 2 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Chaos Engineering, Continuous Integration Enterprise(CIE) - Cloud Builds, Continuous Integration Enterprise(CIE) - Self Hosted Runners, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM)).