Intermittent connections errors on Prod4

Incident Report for Harness

Postmortem

Summary

On Friday 7 Mar 2025, the Prod4 cluster experienced a disruption when the Global Gateway service stopped serving incoming requests. The incident was caused by a configuration mismatch during a planned version upgrade. The system was fully recovered after approximately 12 minutes of downtime, out of which 7 minutes were full downtime and 5 minutes were partial service disruption.

Resolution

The team quickly identified the configuration mismatch and reverted to the previous configuration settings. After bouncing the Global Gateway pods, the system recovered, and normal service was restored.

RCA

During a planned upgrade from version 1.16.0 to version 1.17.2 of the Global Gateway service, a procedural error caused the new configuration intended for version 1.17.2 to be deployed while the older version 1.16.0 was still running in production. The older version was incompatible with the new configuration parameters, causing the service to stop responding to requests.

Action Items

  1. Enhanced Deployment Oversight and controls: Implement additional validation checks in the deployment pipeline to verify version compatibility with configuration changes.
  2. Improved Architecture Resilience: Accelerate our planned architecture improvements to make the system more resilient to configuration changes and prevent similar failures in the future.

Our team is committed to implementing these improvements to prevent similar incidents in the future.

Posted Mar 11, 2025 - 00:15 PDT

Resolved

This incident has been resolved.
Posted Mar 07, 2025 - 17:00 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Mar 07, 2025 - 16:59 PST

Identified

Global Gateway intermittently encountering connection errors in Prod 4
Posted Mar 07, 2025 - 16:54 PST

Investigating

Global Gateway intermittently encountering connection errors in Prod 4
Posted Mar 07, 2025 - 16:53 PST
This incident affected: Prod 4 (Continuous Delivery - Next Generation (CDNG)).