Security Testing Orchestration Outage – EU Region.

Incident Report for Harness

Postmortem

Summary

Between 19:15 PST and 23:15 PST on 23 November 2025, customers on the EU1 cluster experienced an STO service outage. The outage was caused by a spike in memory usage, which pushed the pods into an unhealthy state.

Root Cause

The refid-cache sidecar was configured with a 256 MiB memory limit. During an unusually large CVE/EPSS data sync, memory usage exceeded this limit, resulting in OOMKills. This caused the pod to be marked unhealthy and enter a CrashLoopBackOff state, rendering sto-core unavailable in the EU1 cluster.

Impact

  • Customer Impact: All customers on the EU1 cluster were unable to use STO.

    • STO scans failed
    • STO API endpoints were unreachable
    • No STO functionality was available while sto-core was down
  • Other Environments: No impact on non-EU clusters

  • Duration: Approximately four hours (19:15 PST – 23:15 PST, 23 November 2025)

Remediation

Immediate Fix

  • Increased the refid-cache sidecar memory limit from 256 MiB to 1 GiB.

Action Items

  • Implement enhanced monitoring and alerting on memory utilization and any downstream impacts.
  • Improve service reliability through design updates to ensure components remain resilient and do not cause full service outages.
  • To Prevent such issues and to be proactive Continue memory profiling and load testing of CVE/EPSS sync workloads to validate and optimize memory limits.
Posted Nov 30, 2025 - 21:48 PST

Resolved

This incident has been resolved.
Posted Nov 24, 2025 - 00:00 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Nov 23, 2025 - 23:15 PST

Update

We are continuing to work on a fix for this issue.
Posted Nov 23, 2025 - 23:05 PST

Identified

The issue has been identified and a fix is being implemented.
Posted Nov 23, 2025 - 19:15 PST
This incident affected: Prod Eu1 (Security Testing Orchestration (STO)).