On Friday 7 Mar 2025, the Prod4 cluster experienced a disruption when the Global Gateway service stopped serving incoming requests. The incident was caused by a configuration mismatch during a planned version upgrade. The system was fully recovered after approximately 12 minutes of downtime, out of which 7 minutes were full downtime and 5 minutes were partial service disruption.
The team quickly identified the configuration mismatch and reverted to the previous configuration settings. After bouncing the Global Gateway pods, the system recovered, and normal service was restored.
During a planned upgrade from version 1.16.0 to version 1.17.2 of the Global Gateway service, a procedural error caused the new configuration intended for version 1.17.2 to be deployed while the older version 1.16.0 was still running in production. The older version was incompatible with the new configuration parameters, causing the service to stop responding to requests.
Our team is committed to implementing these improvements to prevent similar incidents in the future.