Degradation in active pipelines observed and actively being debugged
Incident Report for Harness
Postmortem

Overview

We experienced a service disruption in our production environment, specifically impacting the Redis memory usage in our freemium offering.

What was the issue?

The core of the issue was the Redis memory in prod2 (freemium) reaching near full capacity. This led to operational failures in dependent services, primarily due to Redis running out of memory (OOM). The root cause analysis identified a significant increase in memory consumption by one of the Redis streams (named freemium:streams:DEBEZIUM_idpMongo.idp-harness.backstageCatalog), which started consuming an unusually high amount of memory (~6 GB) following the latest release of the idp-service (version 1.6.0). Furthermore, pipeline service-related caches were also found to be consuming higher memory than anticipated.

Timeline

Time (In IST) Event
1st March 11.45 PM STO uptime monitoring failed with Redis OOM
1st March 11.53 PM FH Triggered
1st March 11.54 PM Pipeline failures reported due to Redis OOM
2nd March 2.02 AM Redis Events framework database memory was increased by 25%
2nd March 2.03 AM Issue resolved after memory increase
3rd March 12.36 AM debezium service is bounced with updated config which disabled IDP mongo collections streaming
3rd March 1.11 AM Stream “freemium:streams:DEBEZIUM_idpMongo.idp-harness.backstageCatalog“ was trimmed in prod2 to reclaim the memory

Resolution

The immediate resolution involved increasing the memory allocated to the Redis events framework database by 25% and disabling the stream flow that was consuming excessive memory. This action effectively resolved the incident within two hours.

Action Items

Following this incident, we are taking several steps to prevent recurrence:

  • Implement rigorous validation of changes with respect to Redis memory usage in both QA and PROD environments with each release.
  • Investigate and rectify the increased message size issue in the backstageCatalog stream when published to Redis.
  • Establish alerts for individual streams to promptly notify the relevant teams.
  • The Pipeline team will conduct a thorough review of streams related to their services, including the webhook_events_stream and git_push_event_stream.
Posted Mar 06, 2024 - 20:18 PST

Resolved
We can confirm normal operation.
We will continue to monitor and ensure stability.
Posted Mar 01, 2024 - 12:31 PST
Monitoring
Service issues have been addressed and normal operations has been resumed. We are monitoring the service to ensure normal performance continues.
Posted Mar 01, 2024 - 12:19 PST
Update
We are continuing to work on a fix for this issue.
Posted Mar 01, 2024 - 12:15 PST
Identified
The resource constraint has been identified and we are working to mitigate the situation.
Posted Mar 01, 2024 - 11:14 PST
Investigating
We are debugging an incident that is potentially impacting pipelines due to a core db component.

The issue started at 10:05 AM PT and team is currently trying to root cause.
Posted Mar 01, 2024 - 10:40 PST
This incident affected: Prod 2 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Cloud Builds) and Prod 1 (Continuous Delivery - Next Generation (CDNG), Continuous Integration Enterprise(CIE) - Cloud Builds).