Degraded Performance Observed for Prod3.

Incident Report for Harness

Postmortem

Summary

Between 2025-09-27 19:31:22 PDT and 2025-09-27 20:14:00 PDT, the secondary database node in Prod3 experienced a downtime. The issue was mitigated by restarting the affected nodes.

Root Cause

A maintenance activity aimed at reducing fragmentation and stale state buildup inadvertently caused resource pressure on the system.

Impact

Customers were unable to run pipelines during the outage.

Remediation

Immediate: Restarted the nodes and rolled back recent changes to restore service.

Permanent: Ongoing improvements to prevent recurrence.

Action Items

  • Strengthen defensive checks: Implement granular memory monitoring metrics to identify fragmentation early.
  • Enhance monitoring & alerting: Add targeted health checks to detect similar issues at lower thresholds before impacting customers or applications.
Posted Oct 08, 2025 - 11:52 PDT

Resolved

Pipelines are running now.
Posted Sep 27, 2025 - 23:00 PDT

Monitoring

Pipeline are now back up and running now.
Posted Sep 27, 2025 - 20:32 PDT

Identified

We have identified the issue. We are working on a fix now.
Posted Sep 27, 2025 - 20:06 PDT

Investigating

We are currently investigating this issue.
Posted Sep 27, 2025 - 19:45 PDT
This incident affected: Prod 3 (Continuous Delivery (CD) - FirstGen - EOS, Continuous Delivery - Next Generation (CDNG), Cloud Cost Management (CCM), Continuous Error Tracking (CET), Continuous Integration Enterprise(CIE) - Self Hosted Runners, Continuous Integration Enterprise(CIE) - Mac Cloud Builds, Continuous Integration Enterprise(CIE) - Windows Cloud Builds, Continuous Integration Enterprise(CIE) - Linux Cloud Builds, Custom Dashboards, Feature Flags (FF), Security Testing Orchestration (STO), Service Reliability Management (SRM), Chaos Engineering, Internal Developer Portal (IDP), Infrastructure as Code Management (IaCM), Software Supply Chain Assurance (SSCA), Software Engineering Insights (SEI), Code Repository, Artifact Registry, Platform).