Prod-3 was intermittently unavailable

Incident Report for Harness

Postmortem

Summary

On 17th April, between 10:42 AM UTC - 11:12 AM UTC, customers experienced intermittent errors when trying to access app3.harness.io on our Prod-3 cluster. The issue was caused by a configuration change on a failover cluster in the backend ingress-controller service setup associated with app3.harness.io.

Resolution

Our monitoring system alerted us to the issue , we identified and reverted the change to mitigate the issue which restored all the functionality in Prod-3 cluster.

RCA

As part of preparation work for a planned Disaster Recovery (DR) activity, we introduced a new configuration in the Prod-3 cluster. This change unintentionally made the Prod-3 DR environment eligible to receive live customer traffic. Since this environment was not fully operational some of the requests were returned with 503 Errors.

Action Items

Enhanced monitoring on traffic going to inactive environments.
Additional safeguards in deployment process to avoid unintentional traffic routing changes.

Posted Apr 17, 2025 - 13:45 PDT

Resolved

This incident has been resolved.

Posted Apr 16, 2025 - 04:12 PDT

Investigating

We noticed intermittent failures in our Prod-3 clusters where app3.harness.io was resulting in 5xx errors. This issue has been identified and is now resolved. Please monitor this incident for postmortem report on this. Thanks for your patience.

Posted Apr 16, 2025 - 03:42 PDT

This incident affected: Prod 3 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Chaos Engineering, Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository).