Policy evaluation(OPA) failure in Prod2

Incident Report for Harness

Postmortem

Summary

On January 26, 2026, the Policy Management service in the Prod2 environment experienced a partial outage during a deployment. A database migration attempted to create a new index on a large production table containing more than 70 million records. The index creation caused elevated database load and temporary service instability, resulting in failed policy evaluations. The issue was self-resolved once the index creation completed, and service stability was restored. 

Impact

During the incident window (approximately 11:15 AM – 11:28 AM PST):

  • Policy evaluations in Prod2 failed intermittently.
  • Pipelines and deployments dependent on OPA-based policy evaluations failed during this period.
  • Other database operations unrelated to the affected table continued to function normally.

This was a partial outage limited to policy evaluation functionality in Prod2.

There was no data loss, and the impact was confined to the duration of the index creation process.

Root Cause

As part of a deployment, a new database index was introduced on a large production table used for policy evaluations. The index creation was executed through an application migration process.

Because the table contained a significant volume of records, the index creation operation placed heavy load on the database and temporarily restricted access to the table. This caused the Policy Management service to encounter repeated database connection failures and enter crash-restart loops.

Mitigation

Immediate mitigation occurred when:

  • The index creation completed successfully.
  • Database load normalized.
  • Policy Management service instances recovered automatically and resumed normal operation.

Post-incident verification confirmed:

  • Required indexes were properly created.
  • Database performance metrics returned to normal levels.
  • Policy evaluation functionality was fully restored.

For subsequent production environments, required indexes were created manually and verified prior to deployment to prevent recurrence.

Action Items

To prevent similar incidents in the future, the following actions are being implemented:

  • Enhancing process for creating indexes on large or critical tables outside of application deployment workflows.
  • Enhance our pre-deployment validation to check for indexes
Posted Feb 11, 2026 - 10:08 PST

Resolved

This incident has been resolved.
Posted Jan 27, 2026 - 22:12 PST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jan 27, 2026 - 22:11 PST
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Security Testing Orchestration (STO), Software Supply Chain Assurance (SSCA)).