Chaos Engineering users in Prod2 cluster may be experiencing issues

Incident Report for Harness

Postmortem

Incident Summary:

The Chaos dashboard on the Prod2 cluster was inaccessible to users, returning a 404 error code when attempting to access it.

Timeline

Time (UTC) Event
November 5, 6:58 AM An alert was triggered.
November 5, 7:08 AM We identified that the issue was related to ingress.
November 5, 7:16 AM The ingress class issue caused by the helm migration was fixed and started monitoring.
November 5, 7:19 AM The incident has been successfully resolved.

Root Cause Analysis:

During the Helm switchover, all module ingresses were duplicated under a new Ingress Class. Automation was created to ensure that each service's ingress rule was duplicated. However, the Chaos ingress rules followed an older method of implementing the Ingress Class which uses annotations, which was missed by the automation.

Immediate Resolution:

The Chaos ingress rules were recreated with valid Ingress Class.

Action Items:

  • Update the Chaos ingress rules to use the new method of associating the Ingress Class using Key Value Pair.
  • Enhance the post-migration validation framework to include automated checks for data accuracy and integrity.
Posted Jan 09, 2025 - 20:25 PST

Resolved

This incident has been resolved. We regret the inconvenience that this might have caused.
Posted Nov 04, 2024 - 23:19 PST

Monitoring

The fix has been put in and we are monitoring the system at the moment.
Posted Nov 04, 2024 - 23:16 PST

Identified

The issue has been identified and the team is actively working to fix it
Posted Nov 04, 2024 - 23:08 PST

Investigating

We are currently investigating an issue where Chaos Engineering customers in Prod2 may be experiencing issues.
Posted Nov 04, 2024 - 23:06 PST
This incident affected: Prod 2 (Chaos Engineering).