Dependency workload not coming up for Autostopping
Incident Report for Harness
Postmortem

Overview

On November 7, 2023, a delay in the Autostopping feature of the Harness CCM platform was observed, affecting certain customers and causing Autostopping rules under fixed schedules not to start at the expected time. This issue was traced back to the system not correctly accounting for daylight savings time. The incident was resolved within 6 hours, with no actual downtime experienced by customers.

Timeline

Time Event
November 7, 2023, 6:00 PM IST Customer reported that Autostopping rules were not initiating as expected.
November 7, 2023, 6:27 PM IST Incident response initiated.
November 7, 2023, 7:00 PM IST Confirmed customers could manually start resources.
November 7, 2023, 7:00 PM IST Issue identified - Fixed schedules were not accounting for daylight savings.
November 7, 2023, 8:00 PM IST Ensured there were no infrastructure issues.
November 7, 2023, 8:30 PM IST Generated new cron entries considering daylight savings.
November 7, 2023, 11:45 PM IST Completed regenerating cron entries for all schedules.

Root Cause Analysis

The delay in Autostopping rules starting under fixed schedules was due to the system not accounting for daylight savings.

The issue was traced back through the following steps:

  1. Affected Autostopping rules were under fixed schedules with warm-up operations starting one hour before the expected time.
  2. The use of a time zone with daylight savings (America/New York) caused the schedule to start an hour earlier than expected.
  3. The initial generation of cron entries for schedules did not consider daylight savings.
  4. The system erroneously executed the idle time job because the schedule was triggered early, despite being under a fixed schedule.

Action Items

  1. The team is working on a long term fix for this so that daylight savings are automatically considered for fixed schedules
  2. Include the daylight computation savings in our tests plan
Posted Nov 08, 2023 - 08:57 PST

Resolved
This incident has been resolved.
Posted Nov 07, 2023 - 07:26 PST
Update
Customer was able to manually start the resources, so there was no downtime. Team is looking into what went wrong with the schedule.
Posted Nov 07, 2023 - 07:04 PST
Update
We have reviewed our operational metrics and can confirm that this issue is isolated to some rules for a single customer.
Warm up operations across all other customers are functioning as expected and is not impacted.
Posted Nov 07, 2023 - 06:34 PST
Monitoring
Customer has used manual start button to start the resources and the services are up and running. No other customer was impacted.
Posted Nov 07, 2023 - 05:57 PST
Identified
The issue has been identified and they have used manual start button to start the resources and the services are up and running
Posted Nov 07, 2023 - 05:38 PST
Update
We are continuing to investigate this issue.
Posted Nov 07, 2023 - 05:21 PST
Investigating
Trying to reach out to Autostopping team to look into this
Posted Nov 07, 2023 - 05:11 PST
This incident affected: Prod 2 (Cloud Cost Management (CCM)).