CI builds for MacOS are experiencing an outage

Incident Report for Harness

Postmortem

Summary

‌

On June 8, 2025, Harness detected unauthorized cryptomining activity affecting a subset (~<1%) of machines within our dedicated Harness Cloud infrastructure. The anomalous behavior was promptly identified through our internal alerting mechanisms, and eradicated within two hours of detection.

Impact

Harness Cloud jobs experienced temporary unavailability.
Freemium Linux builds were rerouted to alternative infrastructure with minimal disruption.
No other production systems or customer data were affected during this event.

Our investigation revealed that as part of routine maintenance, a port to a configuration management instance was made available to the internet, which resulted in unauthorized access to this system. Although the exposure time was limited to a brief window, the actor responsible was able to leverage this access to install a popular open source miner on a subset of our systems.

While the impact was isolated to these systems, we took immediate and comprehensive action to contain the situation by failing over our Harness Cloud to a stand by Disaster Recovery failover. Our security and infrastructure teams initiated a full incident response protocol, including remediating the exposed entry point, revalidating system configurations, and prioritizing restoration efforts.

As part of our commitment to operational integrity, we made a decision out of an abundance of caution to reimage our Harness Cloud fleet. This step ensured all systems were returned to a clean, known-good state and allowed us to eliminate any latent risk from the environment.

Our Commitment to Security and Reliability

While this incident was contained and resolved quickly, we used it as an opportunity to further strengthen our Harness Cloud platform. We have accelerated efforts to align all supporting infrastructure with the same high standards of availability, operational oversight, and trust that customers expect from our core services.

Specifically, we have moved to centralize management of the affected configuration tooling, which programmatically applies our standard security deployment gates - including validation of network configuration for system deployments.

During the process to redeploy some systems, we discovered an infrastructure-imposed hard limit on the number of systems that can be rebooted or republished simultaneously, resulting in an extended timeframe for full restoration. As a result, we are also exploring automated mechanisms to expedite the redeployment of our fleet for the purpose of disaster recovery

Posted Jun 24, 2025 - 11:55 PDT

Resolved

We have identified and resolved the incident. We will follow up with the detailed RCA.

Posted Jun 09, 2025 - 10:22 PDT

Update

We are continuing to monitor for any further issues.

Posted Jun 09, 2025 - 09:25 PDT

Update

We are continuing to monitor for any further issues.

Posted Jun 09, 2025 - 09:01 PDT

Update

We are continuing to monitor for any further issues.

Posted Jun 09, 2025 - 08:55 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jun 09, 2025 - 06:16 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted Jun 09, 2025 - 00:54 PDT

Investigating

We are currently investigating this issue.

Posted Jun 08, 2025 - 20:07 PDT

This incident affected: Prod 4 (Continuous Integration Enterprise(CIE) - Mac Cloud Builds), Prod 3 (Continuous Integration Enterprise(CIE) - Mac Cloud Builds), Prod Eu1 (Continuous Integration Enterprise(CIE) - Mac Cloud Builds), Prod 1 (Continuous Integration Enterprise(CIE) - Mac Cloud Builds), and Prod 2 (Continuous Integration Enterprise(CIE) - Mac Cloud Builds).