Login issues on our Prod3 environment

Incident Report for Harness

Postmortem

Summary

The Prod3 cluster experienced downtime, preventing users from accessing the Harness UI. Only access to Prod3 was affected but the pipeline executions were not impacted. 

Resolution

To mitigate the issue, Harness services were auto-scaled. Additionally, rate limiting and timeouts were implemented for specific API endpoints to regulate the load. These measures effectively reduced system strain, allowing the platform to recover and resume normal operations.

Timeline

Time (UTC) Event
March 4, 2025, 7:25 AM UTC Investigating login issue in prod3 environment. Prod3 cluster was under pressure and rejecting requests
March 4, 2025, 7:30 AM UTC Reverted system release
March 4, 2025, 7:38 AM UTC Changed status to monitoring. System is operating normally

The Root Cause Analysis (RCA)

One of the core micro-services in the Harness platform was receiving a high volume of external traffic. The API endpoint under load was executing a long-running analytical query, which became slow during this period. This slowdown triggered a cascading effect across the infrastructure, leading to the unavailability of underlying services.

As the load increased, new requests began to fail. Since the Harness UI depends on responses from backend APIs, the pages failed to load. 

Action Items

  1. Move analytical services to a separate end point to prevent such issues impacting critical workflow
Posted Mar 11, 2025 - 17:03 PDT

Resolved

This incident has been resolved.
Posted Mar 04, 2025 - 03:59 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 03, 2025 - 23:38 PST

Investigating

We are currently investigating logging issue in prod3 environment
Posted Mar 03, 2025 - 23:25 PST
This incident affected: Prod 3 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Chaos Engineering, Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository).