Security Testing Orchestration Outage – EU Region.

Incident Report for Harness

Postmortem

Summary

‌

Between 19:15 PST and 23:15 PST on 23 November 2025, customers on the EU1 cluster experienced an STO service outage. The outage was caused by a spike in memory usage, which pushed the pods into an unhealthy state.

Root Cause

The refid-cache sidecar was configured with a 256 MiB memory limit. During an unusually large CVE/EPSS data sync, memory usage exceeded this limit, resulting in OOMKills. This caused the pod to be marked unhealthy and enter a CrashLoopBackOff state, rendering sto-core unavailable in the EU1 cluster.

‌

Impact

‌

Customer Impact: All customers on the EU1 cluster were unable to use STO.
- STO scans failed
- STO API endpoints were unreachable
- No STO functionality was available while sto-core was down
Other Environments: No impact on non-EU clusters
Duration: Approximately four hours (19:15 PST – 23:15 PST, 23 November 2025)

‌

Remediation

‌

Immediate Fix

Increased the refid-cache sidecar memory limit from 256 MiB to 1 GiB.

Action Items

Implement enhanced monitoring and alerting on memory utilization and any downstream impacts.
Improve service reliability through design updates to ensure components remain resilient and do not cause full service outages.
To Prevent such issues and to be proactive Continue memory profiling and load testing of CVE/EPSS sync workloads to validate and optimize memory limits.

Posted Nov 30, 2025 - 21:48 PST

Resolved

This incident has been resolved.

Posted Nov 24, 2025 - 00:00 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 23, 2025 - 23:15 PST

Update

We are continuing to work on a fix for this issue.

Posted Nov 23, 2025 - 23:05 PST

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 23, 2025 - 19:15 PST

This incident affected: Prod Eu1 (Security Testing Orchestration (STO)).