Degradation in sync executions
Resolved
Apr 14, 2026 at 8:00am UTC
Post-Incident Summary
Date: 12 April 2026
Impact: Degraded sync execution and delayed actions and webhook processing
Status: Resolved
Summary
A webhook flood originating from a single customer environment caused one of our databases to saturate, resulting in broad degradation of asynchronous job processing. Sync execution dropped to near zero, and a large portion of actions and webhook-driven work were delayed or unable to run. A secondary bug in the scheduling system amplified the incident and blocked two consecutive recovery attempts before a fix was deployed.
Timeline (UTC)
- Issue began: 07:00
- Detected by monitoring: 07:00
- Status page updated: 07:00
- Mitigated: 15:30
- Resolved: 15:55
Root Cause
A single customer environment generated a sustained webhook flood, well above the typical baseline. Each incoming webhook triggered a database query to check the current queue depth for that customer's group before deciding whether to admit a new task. Under flood conditions, this query saturated one of our databases' CPU, preventing other work — including syncs and actions from all customers — from being scheduled or processed.
Once the per-group queue cap was reached, new work could no longer be enqueued, and the system remained effectively stalled.
Recovery was complicated by a separate bug in the recurring schedule path. When the scheduler encountered a group that had already hit the queue cap, an error in the code caused the exception to be swallowed silently. As a result, affected schedules were never marked as processed and were repeatedly retried on each scheduler tick, adding further load to an already saturated database. This caused two consecutive recovery attempts to fail.
Resolution
- A fix was deployed to correct the scheduling bug, ensuring that capped groups are handled correctly and schedules are properly advanced after each pass.
- Task execution times were shifted forward in bulk to drain pressure from the database, then restored in batches.
- Once the backlog cleared, the system returned to a healthy state and full processing resumed by 15:55 UTC.
Follow-Up Actions
System safeguards
- Improve the current per-enqueue queue-depth admission-control mechanism to reduce database load under flood conditions.
- Define a rate limiting and load shedding strategy for webhook ingestion to protect the platform when a single customer generates sustained enqueue pressure.
- Fix the scheduling bug to correctly handle capped groups without silent failures (completed).
Affected services
Updated
Apr 12, 2026 at 3:55pm UTC
We are seeing full recovery over the last 10 minutes and will continue monitoring our metrics.
Affected services
Updated
Apr 12, 2026 at 3:37pm UTC
We have found an infrastructure issue that may have been preventing recovery and have already fixed it. We are seeing partial recovery on syncs and actions and continue moving traffic over to bring the system to full recovery.
Affected services
Updated
Apr 12, 2026 at 2:45pm UTC
We have not yet fully recovered as some underlying infrastructure issues caused by high load have not been fully addressed. We are working towards a solution to fully reduce the load.
Affected services
Updated
Apr 12, 2026 at 1:13pm UTC
We are still seeing partial recovery for actions and syncs but are still working towards a full recovery.
Affected services
Updated
Apr 12, 2026 at 12:01pm UTC
We are seeing some partial recovery for actions but syncs remain fully impacted. We are working on reducing the load in one of our infrastructure components.
Affected services
Updated
Apr 12, 2026 at 10:50am UTC
Syncs and actions remain impacted. We've identified a load-related issue in our infrastructure and are actively working on mitigation.
Affected services
Updated
Apr 12, 2026 at 9:49am UTC
We have deployed a first step to mitigate the issue but are not seeing recovery yet; we continue investigating how to speed up recovery.
Affected services
Updated
Apr 12, 2026 at 9:05am UTC
We are still investigating the issues impacting syncs and actions, and working on early mitigation steps.
Affected services
Created
Apr 12, 2026 at 8:00am UTC
Syncs are still delayed, while actions appear unaffected. The issue seems to be related to how the database is handling sync schedules. Once sync processing recovers, synced data will catch up automatically.
Affected services