Customers have reported experiencing longer queue times for their Continuous Integration (CI) stages when using Harness Cloud infrastructure. Although the queue limits were not reached, builds remained queued, leading to extended waiting periods as they awaited progression.
Time (UTC) | Event |
---|---|
March 4, 2025, 10:16 PM | Customer reported Queued builds |
March 4, 2025, 11:06 PM | Increased the limits for customers to unblock |
March 5, 2025, 2:00 AM | Reverted application which was suspected to have caused the issue |
March 5, 2025, 7:30 AM | We continued to investigate the issue as we were still seeing some missed cleanups and we also performed cleanup of stale metadata captured to prevent from further queuing |
March 5, 2025, 4:50 PM | We saw a spike in resources consumed by our apps as the peak load approached which was mitigated by increasing the resources and stabilized the app. |
March 5, 2025, 7:40 PM | Issue was narrowed down to the Jackson library upgrade and we started the rollback test on lower environment. |
March 5, 2025, 9:54 PM | We now rolled back to the previous version of the application and continued to monitor. During this time we noticed increased resource consumption on our Mongo instance, which further caused the stability issue and stuck ci stages. |
March 5, 2025, 11:18 PM | We decided to roll forward the release and undo the revert. Post which the system stabilized. |
March 6, 2025, 3:26 PM | We worked on the forward fix post the stabilization and released it to production. |
We immediately increased the queue sizes for impacted customers to enable their build stages to progress. Subsequently we fixed the library issue and rolled out newer release. We are improving our alerting and automation to proactively determine any potential issue with resource cleanup at scale.
A recent Jackson library upgrade slowed down the CI manager's cleanup thread, causing back pressure on the system during peak periods.
With the Jackson library upgrade from 2.15.2 to 2.17.2, the ObjectMapper
implementation changed to use a ReentrantLock
object. During persistence, Spring recursively reads instance objects and serializes them via reflection. However, Java restricts access to ReentrantLock
fields via reflection, causing serialization exceptions.
As a side effect of Jackson library upgrade, the load on one of our services increased significantly causing restart of the pods which lead to stuck executions of few CI stages.
The above led to pipelines entering a queued state and, due to resource constraints, some pipelines failing to execute.