Intermittent pipeline timeouts

Incident Report for Harness

Postmortem

What was the issue?

Pipelines in Prod1 were experiencing intermittent failures caused by gRPC connection issues between Harness services. The majority of the failed gRPC requests occurred between the CI Manager and Harness Manager (CG), resulting in a primary impact on CI pipelines.

Timeline

Time	Event
May 28, 9:50 AM PDT	The issue was reported by a customer regarding intermittent pipeline failures. The team initiated an investigation but did not identify any issues with the infrastructure. It was determined that the failure was isolated to one specific customer. Teams were promptly alerted to monitor the pipelines of this particular customer.
May 28, 3:05 PM PDT	The engineering team has observed a few more number occurrences of the issue across other pipelines.
May 28, 3:30 PM PDT	The team has made the decision to rollback a recent deployment in order to investigate any potential correlations.
May 28, 5:15 PM PDT	The status for the Prod1 environment has been updated to "degraded performance" due to intermittent issues.
May 28, 6:20 PM PDT	The issue was suspected to be related to kube DNS resolution, resulting in some gRPC requests failing randomly. GCP support was engaged for further investigation. Additionally, service thread dumps were captured for internal debugging purposes, revealing no suspicious findings within the thread dumps.
May 28, 7:00 PM PDT	Declared status to “monitoring”.

RCA and Action Items:

Pipelines experienced failures as a result of internal service communication issues, despite multiple attempts. Initially, the Engineering team suspected kube-DNS problems; however, after consulting with GCP support, this was ruled out. It was noted that certain service pod replicas were receiving an uneven distribution of requests. To tackle this issue, the following corrective actions are currently underway:

Enhancing load balancing for gRPC calls among service pods.
Incorporating traceId in delegate task submissions.

Furthermore, we have set up alerts related to gRPC to address similar situations in the future.

Posted Jun 04, 2024 - 16:36 PDT

Resolved

We can confirm normal operation. Get Ship Done!
We will continue to monitor and ensure stability.

Posted May 28, 2024 - 23:06 PDT

Monitoring

Harness service issues have been addressed and normal operations have been resumed. We are monitoring the service to ensure normal performance continues.

Posted May 28, 2024 - 19:01 PDT

Identified

We have identified that the intermittent service disruption is limited to a small number of customers. Our team is actively engaged in investigating the root cause of the issue and is working diligently to restore full functionality as quickly as possible.
We understand the importance of service reliability and the inconvenience this may cause for our affected customers.
We apologize for any disruption this may have caused and appreciate your patience during this time. We will provide further updates as soon as we have more information or when the issue has been fully resolved.

Posted May 28, 2024 - 18:21 PDT

Investigating

Current Status: The team is actively investigating intermittent timeouts occurring in our pipeline services. Engineers are working to identify the root cause and implement a fix.
User Impact: Some users may experience delays or failures when trying to access or use pipeline-related functionality. Not all users are impacted.
We apologize for any inconvenience caused by these pipeline service timeouts. The team is treating this as a high priority issue and is working diligently to restore full performance and stability as quickly as possible. Your patience is appreciated as we work through this matter.

Posted May 28, 2024 - 17:15 PDT

This incident affected: Prod 1 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Mac Cloud Builds).