Split FME outbound impression integrations are delayed.

Incident Report for Harness

Postmortem

Summary

  • Between Dec 28, 2025 00:04 UTC and Jan 8, 2026 11:48 UTC, impressions integration data experienced delays of varying degrees.
  • Amazon S3 integrations were impacted from Dec 28 through Jan 8, with delays reaching up to 36 hours at peak.
  • Amplitude, Segment, and custom webhook integrations were impacted from Jan 2 through Jan 7, with delays reaching up to 14 hours at peak.
  • A small number of customers experienced data loss due to rate limiting at their destination during recovery; the vast majority of customers received all their impressions data.

Root Cause

Significant increases in impressions volume caused our integration pipelines to reach their maximum throughput capacity. The S3 integration encountered volume growth that exceeded its processing capacity, while the Amplitude, Segment, and webhook integrations faced similar throughput constraints as traffic continued to increase.

Impact

  • Outbound impressions data to S3, Amplitude, Segment, and custom webhook destinations was delayed.
  • Customers using these integrations would have seen data arrive later than expected.

What was not impacted?

  • SDK feature flag evaluations and targeting
  • FME flag delivery network
  • Events integrations
  • Admin API and UI access
  • Customer flag configuration data

Remediation

For S3 integrations, we reordered and regrouped jobs to prioritize larger integrations, allowing them more time to complete. For Amplitude, Segment, and webhook integrations, we increased throughput through configuration changes within the data pipeline.

Action Items

  • Rebuild webhook integration architecture: We are implementing a new architecture for Amplitude, Segment, and webhook integrations that provides better isolation from noisy neighbors and higher maximum throughput.
  • Improve S3 batch processing: We are separating batch workloads to prevent a single slow job from delaying others, with prioritization now in place for larger jobs.
  • Enhanced monitoring and alerting: New alerts have been deployed for both systems to ensure engineering teams engage with delays earlier, enabling faster recovery.
Posted Jan 12, 2026 - 11:10 PST

Resolved

This incident has been resolved.
Posted Jan 08, 2026 - 08:47 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jan 08, 2026 - 08:40 PST

Update

Issues with Amplitude, Segment, and custom webhooks are fully resolved as of 9:15pm PT
Posted Jan 07, 2026 - 11:57 PST

Identified

For customers that use Webhooks and other third parties (amplitude, segment et al) we are in recovery mode now as well.
Posted Jan 06, 2026 - 14:07 PST

Update

For customers who use S3 for impressions integrations we are seeing a partial data delay. The system is currently in recovery mode, and data is being processed and backfilled. No data loss is expected and it is expected to fully recover in 24 hours.

For customers that use Webhooks and other third parties (amplitude, segment et al) we are testing mitigations and will come back with an ETA.
Posted Jan 06, 2026 - 13:26 PST

Investigating

We are investigating the issue
Posted Jan 06, 2026 - 13:18 PST
This incident affected: Feature Management & Experimentation (FME) (Integrations).