[Prod1/Prod2/Prod3] SEI Service Degraded for Jenkins Users

Incident Report for Harness

Postmortem

Problem

Some customers sending their custom CI/CD data to SEI began reporting that they were seeing a 500 internal error.

Root cause

The issue was caused by a recent change introduced to implement a new feature through a new data flow path. The change was behind a feature flag, but it lacked error handling when a specific part of the code was exercised, specifically when it received data for custom CI/CD events. The API by design, continued to stage data even during this failure, enabling us to restore data for a specific period of the incident, and customers do not need to resend their data.

The errors went unnoticed by our monitoring and alerting systems due to the low volume and did not meet the threshold. we have since updated the thresholds.

Mitigation

Upon identifying the issue, the engineering team immediately rolled back to the previous stable version. Backfill jobs were triggered immediately to restore data from the staging area. As of now, we have restored data for the affected period for all customers impacted by this issue.

Next Steps

We have implemented stringent measures to detect even low volumes of errors in the future, which can help us identify scenarios like this sooner. We have also taken the opportunity to review our error handling strategy throughout the platform and have provided handling for defaults whenever applicable without failing the actual request. 

We apologize for the inconvenience caused to our customers and are deeply committed to making the platform more resilient to these failures going forward.

Posted Jun 26, 2025 - 21:37 PDT

Resolved

This incident has been resolved.
Posted Jun 24, 2025 - 11:00 PDT

Monitoring

After testing, a fix has been rolled out to our Prod 1/2/3 environments. We are currently monitoring these environments.
Posted Jun 24, 2025 - 10:50 PDT

Update

We are continuing to work on a fix for this issue.
Posted Jun 24, 2025 - 10:07 PDT

Update

We are continuing to work on a fix for this issue.
Posted Jun 24, 2025 - 09:37 PDT

Identified

We've identified an issue with our SEI service when it's used with the Jenkins plugin, causing 500 errors. We are working on a fix now.
Posted Jun 24, 2025 - 09:09 PDT
This incident affected: Prod 1 (Software Engineering Insights (SEI)), Prod 2 (Software Engineering Insights (SEI)), and Prod 3 (Software Engineering Insights (SEI)).