Intermittent STO step failure with 500 error

Incident Report for Harness

Postmortem

Summary

On March 25, 2026, between approximately 3:30 PM and 6:00 PM IST, the STO service in the Prod1 environment experienced intermittent failures while processing scan uploads. This resulted in step failures for some pipeline executions during the incident window.

Root Cause

During a scheduled internal data backfill activity, the STO service experienced increased database load. Concurrently, a recent change in the scan upload processing path introduced additional latency under these conditions.

The combination of elevated load and increased query execution time caused some scan upload requests to exceed processing thresholds and fail. Retry attempts further amplified system load, leading to intermittent failures.

Impact

  • Intermittent scan upload failures (500 errors) during pipeline execution
  • Some pipelines experienced step failures or delays due to retries
  • No impact to previously uploaded scan results or other STO functionality

Mitigation/Remediation

Immediate

  • Stopped the internal backfill activity to reduce database load
  • Optimized the scan upload processing query

Permanent

  • Introduced safeguards for background jobs to prevent impact on production workloads
  • Improved performance of critical database paths
  • Enhanced monitoring to detect abnormal load and retry amplification earlier

Action Items

To prevent such issues from happening again:

  • Implement throttling and isolation for background/backfill jobs
  • Add protections for critical request paths under load
  • Improve alerting on database latency and retry patterns
  • Strengthen validation for production-like load conditions
Posted Mar 26, 2026 - 12:55 PDT

Resolved

This incident has been resolved.
Posted Mar 25, 2026 - 05:30 PDT

Investigating

We are currently investigating this issue.
Posted Mar 25, 2026 - 03:00 PDT
This incident affected: Prod 1 (Security Testing Orchestration (STO)).