Traceable - Delay in protection events in APAC region

Incident Report for Harness

Postmortem

Summary

Some customers who are using the Protection module in the APAC cluster experienced significant delays in anomaly detection. Processing of traces was delayed by 4–5 hours, resulting in late anomaly events. The issue was isolated to the APAC cluster; all other product areas and clusters continued to function normally.

Root Cause

The service that the Anomaly Detector depends on kept failing, which prevented the detector from getting the information it needed to run. Because of this, the system stopped processing new data and a large delay built up. Once the underlying issue was fixed, processing resumed, but extra capacity was needed to clear the backlog.

Impact

A subset of customers using the Protection module in the APAC cluster experienced delayed anomaly detections, with results appearing 4–5 hours later than expected. No data loss occurred, and no other clusters or product areas were affected.

Remediation

  • Increased memory allocation for the configuration service to stop OOM restarts and redeployed the dependent service to a dedicated, higher-capacity servers to accelerate processing and clear the backlog.
  • Updated resource provisioning for the configuration service to ensure sufficient capacity and prevent future OOM conditions.

Action Items

To prevent this from happening again, we are

  • Improving system behavior so processing can continue using the last known configuration when the dependent service is unavailable, adjusting resource provisioning to ensure the config service has enough capacity to avoid future failures.

To be proactive and react faster, we are

  • Adding monitoring to detect when configuration fetches repeatedly fail or when processing begins to stall.
Posted Dec 09, 2025 - 22:46 PST

Resolved

This incident has been resolved.
Posted Dec 02, 2025 - 01:13 PST

Monitoring

The fix has been deployed and we’re waiting for the lag to come down
Posted Dec 02, 2025 - 01:04 PST

Update

The fix has been deployed and we’re waiting for the lag to come down
Posted Dec 02, 2025 - 00:50 PST

Update

We are still working on resolving the issue.
Posted Dec 02, 2025 - 00:38 PST

Identified

We have identified the issue and working towards resolving it.
Posted Dec 02, 2025 - 00:06 PST

Investigating

We are currently investigating the issue.
Posted Dec 01, 2025 - 23:55 PST
This incident affected: Traceable (APAC - app.apac.traceable.ai / api.apac.traceable.ai).