CI stages are getting queued in Prod2

Incident Report for Harness

Postmortem

Summary:

Customers have reported experiencing longer queue times for their Continuous Integration (CI) stages when using Harness Cloud infrastructure. Although the queue limits were not reached, builds remained queued, leading to extended waiting periods as they awaited progression.

Timeline:

Time (UTC)	Event
March 4, 2025, 10:16 PM	Customer reported Queued builds
March 4, 2025, 11:06 PM	Increased the limits for customers to unblock
March 5, 2025, 2:00 AM	Reverted application which was suspected to have caused the issue
March 5, 2025, 7:30 AM	We continued to investigate the issue as we were still seeing some missed cleanups and we also performed cleanup of stale metadata captured to prevent from further queuing
March 5, 2025, 4:50 PM	We saw a spike in resources consumed by our apps as the peak load approached which was mitigated by increasing the resources and stabilized the app.
March 5, 2025, 7:40 PM	Issue was narrowed down to the Jackson library upgrade and we started the rollback test on lower environment.
March 5, 2025, 9:54 PM	We now rolled back to the previous version of the application and continued to monitor. During this time we noticed increased resource consumption on our Mongo instance, which further caused the stability issue and stuck ci stages.
March 5, 2025, 11:18 PM	We decided to roll forward the release and undo the revert. Post which the system stabilized.
March 6, 2025, 3:26 PM	We worked on the forward fix post the stabilization and released it to production.

Resolution:

We immediately increased the queue sizes for impacted customers to enable their build stages to progress. Subsequently we fixed the library issue and rolled out newer release. We are improving our alerting and automation to proactively determine any potential issue with resource cleanup at scale.

RCA:

A recent Jackson library upgrade slowed down the CI manager's cleanup thread, causing back pressure on the system during peak periods.

With the Jackson library upgrade from 2.15.2 to 2.17.2, the ObjectMapper implementation changed to use a ReentrantLock object. During persistence, Spring recursively reads instance objects and serializes them via reflection. However, Java restricts access to ReentrantLock fields via reflection, causing serialization exceptions.

As a side effect of Jackson library upgrade, the load on one of our services increased significantly causing restart of the pods which lead to stuck executions of few CI stages.

The above led to pipelines entering a queued state and, due to resource constraints, some pipelines failing to execute.

Actions Items:

Improve the monitoring and alerting for resource cleanup
Implement a cross-team process for validating the library upgrades

Posted Mar 11, 2025 - 12:44 PDT

Resolved

This incident has been resolved.

Posted Mar 04, 2025 - 19:23 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 04, 2025 - 16:55 PST

Identified

The issue has been identified and a fix is being implemented.

Posted Mar 04, 2025 - 16:00 PST

Investigating

We are currently investigating this issue.

Posted Mar 04, 2025 - 15:50 PST

This incident affected: Prod 2 (Continuous Integration Enterprise(CIE) - Linux Cloud Builds).