Back to overview
Degraded

Degradation in sync executions

Apr 12, 2026 at 8:00am UTC
Affected services
Nango Cloud Health

Resolved
Apr 14, 2026 at 8:00am UTC

Post-Incident Summary

Date: 12 April 2026

Impact: Degraded sync execution and delayed actions and webhook processing

Status: Resolved

Summary

A webhook flood originating from a single customer environment caused one of our databases to saturate, resulting in broad degradation of asynchronous job processing. Sync execution dropped to near zero, and a large portion of actions and webhook-driven work were delayed or unable to run. A secondary bug in the scheduling system amplified the incident and blocked two consecutive recovery attempts before a fix was deployed.

Timeline (UTC)

  • Issue began: 07:00
  • Detected by monitoring: 07:00
  • Status page updated: 07:00
  • Mitigated: 15:30
  • Resolved: 15:55

Root Cause

A single customer environment generated a sustained webhook flood, well above the typical baseline. Each incoming webhook triggered a database query to check the current queue depth for that customer's group before deciding whether to admit a new task. Under flood conditions, this query saturated one of our databases' CPU, preventing other work — including syncs and actions from all customers — from being scheduled or processed.

Once the per-group queue cap was reached, new work could no longer be enqueued, and the system remained effectively stalled.

Recovery was complicated by a separate bug in the recurring schedule path. When the scheduler encountered a group that had already hit the queue cap, an error in the code caused the exception to be swallowed silently. As a result, affected schedules were never marked as processed and were repeatedly retried on each scheduler tick, adding further load to an already saturated database. This caused two consecutive recovery attempts to fail.

Resolution

  • A fix was deployed to correct the scheduling bug, ensuring that capped groups are handled correctly and schedules are properly advanced after each pass.
  • Task execution times were shifted forward in bulk to drain pressure from the database, then restored in batches.
  • Once the backlog cleared, the system returned to a healthy state and full processing resumed by 15:55 UTC.

Follow-Up Actions

System safeguards

  • Improve the current per-enqueue queue-depth admission-control mechanism to reduce database load under flood conditions.
  • Define a rate limiting and load shedding strategy for webhook ingestion to protect the platform when a single customer generates sustained enqueue pressure.
  • Fix the scheduling bug to correctly handle capped groups without silent failures (completed).

Updated
Apr 12, 2026 at 3:55pm UTC

We are seeing full recovery over the last 10 minutes and will continue monitoring our metrics.

Updated
Apr 12, 2026 at 3:37pm UTC

We have found an infrastructure issue that may have been preventing recovery and have already fixed it. We are seeing partial recovery on syncs and actions and continue moving traffic over to bring the system to full recovery.

Updated
Apr 12, 2026 at 2:45pm UTC

We have not yet fully recovered as some underlying infrastructure issues caused by high load have not been fully addressed. We are working towards a solution to fully reduce the load.

Updated
Apr 12, 2026 at 1:13pm UTC

We are still seeing partial recovery for actions and syncs but are still working towards a full recovery.

Updated
Apr 12, 2026 at 12:01pm UTC

We are seeing some partial recovery for actions but syncs remain fully impacted. We are working on reducing the load in one of our infrastructure components.

Updated
Apr 12, 2026 at 10:50am UTC

Syncs and actions remain impacted. We've identified a load-related issue in our infrastructure and are actively working on mitigation.

Updated
Apr 12, 2026 at 9:49am UTC

We have deployed a first step to mitigate the issue but are not seeing recovery yet; we continue investigating how to speed up recovery.

Updated
Apr 12, 2026 at 9:05am UTC

We are still investigating the issues impacting syncs and actions, and working on early mitigation steps.

Created
Apr 12, 2026 at 8:00am UTC

Syncs are still delayed, while actions appear unaffected. The issue seems to be related to how the database is handling sync schedules. Once sync processing recovers, synced data will catch up automatically.