CI Harness Cloud builds were non-operational for Prod3.
Incident Report for Harness
Postmortem

Summary:

Customers in Prod3 were unable to run Hosted CI builds.

What was the issue:

The delegate lite microservice that is responsible for hosted CI builds got scaled down due to a misconfiguration so the customers were unable to run CI builds.

Resolution

Time Event
Dec 10 at 11:04 AM IST Customer reported an issue with running CI pipeline.
Dec 10 at 11:22 AM IST Redeployment of Delegate lite microservice was done in Prod3 and the issue resolved.

RCA

The delegate lite deployment follows a blue green deployment model. However, due to a misconfiguration, both the old and the new deployment were stopped after the maintenance window, which was intended to shut down only the old deployment. The issue did not get detected internally due to a misconfigured alert.

Action Items

  • Fix and test the alerts for Hosted CI and related services to ensure they trigger as desired.
  • Fix the delegate lite deployment to scale down only if the required number of containers are active.
Posted Dec 10, 2024 - 12:47 PST

Resolved
This incident is resolved and services are operational.
Posted Dec 09, 2024 - 22:59 PST
Investigating
We would like to notify CI Cloud builds for (Mac Cloud Builds & Linux Cloud Builds) was non-operational from 5:46 AM UTC till 6:06 AM UTC. Currently the services are restored and working fine.
Posted Dec 09, 2024 - 22:58 PST
This incident affected: Prod 3 (Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds).