Feature Flags metrics service down on Prod1
Incident Report for Harness
Postmortem

Summary

Requests to https://events.ff.harness.io where failing for all customers in Prod1 accounts, preventing metrics data from being updated.

What was the issue?

Testing being preformed on the cluster resulted in unexpected behaviour on one of the load balancing backend instances that is used to route traffic for the Feature Flag metrics service.

Resolution

Load balancer backend was re-synced with the running applications, and the traffic resumed.

Timeline

Time(UTC) Event
13 Nov 14:55 Testing service was brought down, causing a cascading tear down of the backend instance
13 Nov 14:56 On call engineer alerted to the traffic errors on Prod1
13 Nov 15:02 Issue identified, and team started to determine the cause
13 Nov 15:10 Issue was fixed, and system sync started
13 Nov 15:14 Traffic resumed

RCA
On Nov 13, 2024, between 14:55 and 15:14, traffic going into the Feature Flag metrics service on Prod1 received a 500 error.

Action Items

  • Review policy checks on services, to ensure no load balancer backends have more than one label associated
Posted Nov 13, 2024 - 10:10 PST

Resolved
Between 14:55 UTC and 15:14 UTC the Feature Flag service on Prod1 experienced an outage on the metrics service.

Customers sending metrics data during this window will have received an 500 error

We have identified the issue and a fix has been applied.

An RCA will follow shortly
Posted Nov 13, 2024 - 07:30 PST