The Prod3 cluster experienced downtime, preventing users from accessing the Harness UI. Only access to Prod3 was affected but the pipeline executions were not impacted.
To mitigate the issue, Harness services were auto-scaled. Additionally, rate limiting and timeouts were implemented for specific API endpoints to regulate the load. These measures effectively reduced system strain, allowing the platform to recover and resume normal operations.
Time (UTC) | Event |
---|---|
March 4, 2025, 7:25 AM UTC | Investigating login issue in prod3 environment. Prod3 cluster was under pressure and rejecting requests |
March 4, 2025, 7:30 AM UTC | Reverted system release |
March 4, 2025, 7:38 AM UTC | Changed status to monitoring. System is operating normally |
One of the core micro-services in the Harness platform was receiving a high volume of external traffic. The API endpoint under load was executing a long-running analytical query, which became slow during this period. This slowdown triggered a cascading effect across the infrastructure, leading to the unavailability of underlying services.
As the load increased, new requests began to fail. Since the Harness UI depends on responses from backend APIs, the pages failed to load.