Queue-Service is impacted for Prod3 customers

Incident Report for Harness

Postmortem

Summary

We encountered issues with the Queue Service, where bidirectional webhooks were marked as queued, and Git changes were not reflected on Harness.

Timeline

TIMELINE (UTC) Event
Nov 18, 2024 - 12:48 PM The customer reported an issue with Bidirectional GitX webhooks in the queued status.
Nov 18, 2024 - 12:49 PM The team analysed the monitoring and service logs and observed issues with Redis connectivity after the deployment.
Nov 18, 2024 - 01:05 PM DBRE was involved and credentials were rotated.

Immediate Resolution

We updated the production Redis configuration and performed a configuration deployment.

RCA

The issue with queued webhooks occurred due to Redis errors affecting the Queue Service, which caused bidirectional GitX webhooks to be queued and not processed. The error during redeployment occurred when an incorrect configuration was pushed during a Redis credential rotation, temporarily disrupting the Queue Service. Connectivity remained intact until the alert was received, which prompted an update to the credentials.

Action Items

We have implemented monitoring for the customer webhook queued status to prevent future issues.

Posted Jan 13, 2025 - 02:03 PST

Resolved

The issue has been resolved. We apologize for the inconvenience and will share the root cause analysis (RCA) shortly.
Posted Nov 18, 2024 - 05:05 PST

Identified

Git back entities will have stale data. This will impact pipeline executions. As a work around, we would recommend disabling the webhook until further notice.
Posted Nov 18, 2024 - 04:48 PST
This incident affected: Prod 3 (Continuous Delivery - Next Generation (CDNG)).