We experienced a service disruption in our production environment, specifically impacting the Redis memory usage in our freemium offering.
The core of the issue was the Redis memory in prod2 (freemium) reaching near full capacity. This led to operational failures in dependent services, primarily due to Redis running out of memory (OOM). The root cause analysis identified a significant increase in memory consumption by one of the Redis streams (named freemium:streams:DEBEZIUM_idpMongo.idp-harness.backstageCatalog), which started consuming an unusually high amount of memory (~6 GB) following the latest release of the idp-service (version 1.6.0). Furthermore, pipeline service-related caches were also found to be consuming higher memory than anticipated.
Time (In IST) | Event |
---|---|
1st March 11.45 PM | STO uptime monitoring failed with Redis OOM |
1st March 11.53 PM | FH Triggered |
1st March 11.54 PM | Pipeline failures reported due to Redis OOM |
2nd March 2.02 AM | Redis Events framework database memory was increased by 25% |
2nd March 2.03 AM | Issue resolved after memory increase |
3rd March 12.36 AM | debezium service is bounced with updated config which disabled IDP mongo collections streaming |
3rd March 1.11 AM | Stream “freemium:streams:DEBEZIUM_idpMongo.idp-harness.backstageCatalog“ was trimmed in prod2 to reclaim the memory |
The immediate resolution involved increasing the memory allocated to the Redis events framework database by 25% and disabling the stream flow that was consuming excessive memory. This action effectively resolved the incident within two hours.
Following this incident, we are taking several steps to prevent recurrence:
backstageCatalog
stream when published to Redis.webhook_events_stream
and git_push_event_stream
.