CCM Asset Governance slow performance
Incident Report for Harness
Postmortem

Incident Summary: There was a recent incident related to delays in the evaluation of Asset Governance Rules, stemming from a queue build-up that caused temporary slowness in rule execution.

Timeline:

  • 2024-01-04 06:18 PM UTC: Incident reported .
  • 2024-01-04 06:20 PM UTC: Incident acknowledged; investigation initiated.
  • 2024-01-04 06:20 PM UTC: Root cause identified.
  • 2024-01-04 06:39 PM UTC: Immediate resolution applied to expedite job processing.
  • 2024-01-04 06:48 PM UTC: Queue size normalized, incident resolved.

Root Cause Analysis: The delay was traced back to a build-up in the job queue utilized by the CCM Asset Governance feature. This model employs an asynchronous execution approach using a job queue, where rule executions are enqueued for processing. Workers asynchronously dequeue jobs from this queue to perform actual rule evaluations.

Analysis: The queue build-up was notable for specific types of evaluations with customers noticing slowness in Asset governance execution.

Immediate Resolution: To promptly address the issue, the team increased the replica count for the services involved, facilitating quicker job consumption from the queue.

Total Downtime: There was no downtime during the incident

Follow-up Actions:

  1. Implementation of separate queues for ad-hoc queries and enforcements/recommendations.
  2. Enhanced telemetry and metrics monitoring, including alerts on queue lengths for various types.
  3. Ongoing investigation to improve asynchronous job execution for faster evaluations.
Posted Jan 09, 2024 - 14:36 PST

Resolved
This incident has been resolved.
Posted Jan 04, 2024 - 10:49 PST
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jan 04, 2024 - 10:40 PST
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 04, 2024 - 10:30 PST
Investigating
We are currently investigating this issue.
Posted Jan 04, 2024 - 10:20 PST
This incident affected: Prod 2 (Cloud Cost Management (CCM)).