Security Testing Orchestration (STO) and IACM module impacted
Incident Report for Harness
Postmortem

Overview: Security Testing Orchestration (STO) and IACM module impacted

What was the issue?

The STO and IaCM modules couldn't complete execution, causing the pipeline execution to time out. The reason was that the Redis keys were rotated, but the two microservices responsible for these modules were still using the older keys.

Timeline:

Time Event
25th Apr 2024 7:03 AM PDT Issue was noticed & investigation started.
25th Apr 2024 7:35 AM PDT Issue Identified.
25th Apr 2024 7:43 AM PDT Issue was resolved for STO. We continued Monitoring.
25th Apr 2024 7:49 AM PDT Issue was resolved for IaCM. We continued Monitoring.
25th Apr 2024 8:00 AM PDT All modules are declared Operational.

Resolution:

The STO and IaCM modules were updated to use the new keys.

RCA & Action Items:

Two microservices were missed in the update because they had different configuration formats in QA vs. Production. Our change management process did not account for this discrepancy. As part of the improvement process, we will standardize the configurations across environments and add relevant checks for key rotation in the change management process.

Posted Apr 26, 2024 - 16:45 PDT

Resolved
This incident has been resolved. Team will be working on a RCA and will share it at the earliest possible.

We apologize for the inconvenience this would have caused.
Posted Apr 25, 2024 - 08:00 PDT
Monitoring
We are monitoring the systems now.
Posted Apr 25, 2024 - 07:53 PDT
Update
IACM is back to operational as well. We will be monitoring the system now.
Posted Apr 25, 2024 - 07:49 PDT
Update
We have resolved the issue for Feature Flags (FF) and service is back to operational
Posted Apr 25, 2024 - 07:46 PDT
Identified
The issue has been identified and we have resolved it for STO in Prod1/Prod2
Posted Apr 25, 2024 - 07:43 PDT
Update
We are continuing to investigate this issue.
Posted Apr 25, 2024 - 07:40 PDT
Investigating
We are currently investigating this issue.
Posted Apr 25, 2024 - 07:03 PDT
This incident affected: Prod 1 (Feature Flags (FF), Security Testing Orchestration (STO), Infrastructure as Code Management (IaCM)) and Prod 2 (Feature Flags (FF), Security Testing Orchestration (STO), Infrastructure as Code Management (IaCM)).