Some customers sending their custom CI/CD data to SEI began reporting that they were seeing a 500 internal error.
The issue was caused by a recent change introduced to implement a new feature through a new data flow path. The change was behind a feature flag, but it lacked error handling when a specific part of the code was exercised, specifically when it received data for custom CI/CD events. The API by design, continued to stage data even during this failure, enabling us to restore data for a specific period of the incident, and customers do not need to resend their data.
The errors went unnoticed by our monitoring and alerting systems due to the low volume and did not meet the threshold. we have since updated the thresholds.
Upon identifying the issue, the engineering team immediately rolled back to the previous stable version. Backfill jobs were triggered immediately to restore data from the staging area. As of now, we have restored data for the affected period for all customers impacted by this issue.
We have implemented stringent measures to detect even low volumes of errors in the future, which can help us identify scenarios like this sooner. We have also taken the opportunity to review our error handling strategy throughout the platform and have provided handling for defaults whenever applicable without failing the actual request.
We apologize for the inconvenience caused to our customers and are deeply committed to making the platform more resilient to these failures going forward.