FME SDK is experiencing elevated error rates for Impressions and events

Incident Report for Harness

Postmortem

Summary

March 17, 2026, FME events & impressions ingestion experienced significant degradation, resulting in elevated latency and error rates. The impact was traced to degraded performance in the underlying shared infrastructure used for event processing.

Root Cause

We had an unexpected surge in traffic which caused stress on our systems.

Impact

SDKs sending impressions and events would experience elevated error logging and continue to retry, with differing policies depending on the particular SDK and it’s retry policy, which are designed and tailored to each runtime environment to avoid any application impact. In some scenarios, events and impressions may be lost if they are not successfully delivered according to the SDK’s specific retry policy. There was no impact to our control plane services and feature flag delivery and evaluations continued to work without any disruption.

Mitigation

To mitigate, we immediately increased capacity to handle the bursty traffic.

Action Items

To prevent such issues from happening again, we are working on the following items:

  1. Evaluate and enforce per customer rate-limit .
  2. Improve the auto-scaling and on-demand network infrastructure scale up.
  3. Improve resiliency of the ingestion layer .
Posted Mar 18, 2026 - 15:59 PDT

Resolved

This incident has been resolved.
Posted Mar 17, 2026 - 12:33 PDT

Monitoring

We are now monitoring the results.
Posted Mar 17, 2026 - 10:05 PDT

Investigating

The issue started around ~6:45AM PT and the team is currently investigating
Posted Mar 17, 2026 - 09:36 PDT
This incident affected: Prod 3 (FME), Prod 1 (FME), and Prod 2 (FME).