CCM: Autostopping rules not working

Incident Report for Harness

Postmortem

Incident Summary:
On January 29, 2024, a disruption occurred in the Prod 2 environment, affecting the execution of AutoStopping rules. Users reported issues, resulting in a total downtime of 56 minutes. The incident was promptly addressed, with a resolution time of 1 hour and 17 minutes.

Timeline of Events:

Timestamp (UTC)	Event
January 29, 2024, 06:13 AM	FireHydrant incident was opened.
January 29, 2024, 06:13 AM	Incident acknowledged, and internal investigation initiated on the incident Slack channel.
January 29, 2024, 06:24 AM	Root cause identified: A component critical for rule execution encountered errors.
January 29, 2024, 06:57 AM	Immediate resolution applied to address the identified component issue.
January 29, 2024, 07:20 AM	System stability restored; rule executions were near optimal.
January 29, 2024, 07:34 AM	FireHydrant incident closed, and the incident marked as resolved.

Root Cause Analysis:

The incident originated from the AutoStopping feature in the Prod 2 environment, causing a critical failure in a component crucial for rule execution. This resulted in a disruption of rule operations and a failure to transition messages to the enqueued state.

The system relies on a data store that encountered difficulties persisting data, leading to operational failures. The root cause was related to capacity limitations in a specific data storage component, causing it to be unable to handle the increased volume of messages during the incident.

Immediate Resolution:

To address the incident promptly, the team increased the capacity of the affected component. This allowed for the expedited processing of rule operations and a swift resolution of the issue.

Preventive Measures:

To prevent similar incidents in the future, the team has implemented enhanced monitoring to receive timely notifications of potential capacity issues. Proactive measures are being taken to ensure the system can effectively handle increased loads.

Conclusion:

The incident was successfully resolved through immediate actions to increase resource capacity. The team is committed to implementing proactive measures to enhance system monitoring and prevent similar occurrences, ensuring the stability and reliability of the system for all users.

Posted Feb 01, 2024 - 00:45 PST

Resolved

This incident has been resolved.

Posted Jan 28, 2024 - 23:23 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 28, 2024 - 23:21 PST

Identified

The issue has been identified and a fix is being implemented.

Posted Jan 28, 2024 - 22:36 PST

Investigating

We are currently investigating this issue.

Posted Jan 28, 2024 - 22:26 PST

This incident affected: Prod 2 (Cloud Cost Management (CCM)).