Pipelines in Prod1 were experiencing intermittent failures caused by gRPC connection issues between Harness services. The majority of the failed gRPC requests occurred between the CI Manager and Harness Manager (CG), resulting in a primary impact on CI pipelines.
Time | Event |
---|---|
May 28, 9:50 AM PDT | The issue was reported by a customer regarding intermittent pipeline failures. The team initiated an investigation but did not identify any issues with the infrastructure. It was determined that the failure was isolated to one specific customer. Teams were promptly alerted to monitor the pipelines of this particular customer. |
May 28, 3:05 PM PDT | The engineering team has observed a few more number occurrences of the issue across other pipelines. |
May 28, 3:30 PM PDT | The team has made the decision to rollback a recent deployment in order to investigate any potential correlations. |
May 28, 5:15 PM PDT | The status for the Prod1 environment has been updated to "degraded performance" due to intermittent issues. |
May 28, 6:20 PM PDT | The issue was suspected to be related to kube DNS resolution, resulting in some gRPC requests failing randomly. GCP support was engaged for further investigation. Additionally, service thread dumps were captured for internal debugging purposes, revealing no suspicious findings within the thread dumps. |
May 28, 7:00 PM PDT | Declared status to “monitoring”. |
Pipelines experienced failures as a result of internal service communication issues, despite multiple attempts. Initially, the Engineering team suspected kube-DNS problems; however, after consulting with GCP support, this was ruled out. It was noted that certain service pod replicas were receiving an uneven distribution of requests. To tackle this issue, the following corrective actions are currently underway:
Furthermore, we have set up alerts related to gRPC to address similar situations in the future.