A rollout involving OpenTelemetry instrumentation changes introduced a memory leak in the OTEL eBPF collector running in production clusters. Under sustained production traffic, the leak caused increasing JVM heap utilization, elevated garbage collection pressure, and eventual out-of-memory (OOM) conditions across several core platform services.
No customer data loss occurred.
The root cause was an upstream defect in the OpenTelemetry eBPF instrumentation library that introduced a memory leak under production-scale workloads. The leak continuously increased telemetry-related memory consumption, leading to sustained JVM garbage collection pressure and eventual heap exhaustion.
Immediate Actions
To prevent such issues from happening again, we are: