Webhooks Are Not Reliable: Designing for Duplicate and Missing Events
Webhook integrations look clean in early builds: provider sends event, you receive it, you process it, and you are done.
In production, behavior gets chaotic. Some events never arrive, some arrive twice, others arrive out of order, and many show up late.
You do not control webhook delivery guarantees.
The Problem: Webhooks Are Best-Effort Delivery
Webhooks are just network requests. They fail, retry, reorder, and delay. Any system that assumes one clean event per state change will eventually corrupt state.
Where It Breaks in Real Systems
1. Duplicate delivery
Provider sends payment.succeeded. You process it, but slow acknowledgment triggers
provider retry. The same event arrives again.
Without protection, you credit twice, grant rewards twice, or trigger duplicate workflows.
2. Missing events
Temporary network failures, timeouts, or downtime can exhaust provider retries. The event is lost.
3. Out-of-order events
You can receive subscription.cancelled before subscription.created. If logic trusts
event order, final state becomes wrong.
4. Delayed delivery
Event happens at 10:00, webhook arrives at 10:07. Your system already made decisions using incomplete data.
The Core Mistake
Treating webhooks as source of truth is the failure. They are unreliable notifications only.
The Fix: Design for Unreliable Events
1. Make handlers idempotent
-- Bad
grant_reward(user_id)
-- Good
grant_reward_if_not_already_given(user_id, event_id)
Enforce with a unique constraint like UNIQUE(event_id).
2. Store every webhook event
Do not execute business logic directly inside the request handler.
webhook_events (
event_id,
type,
payload,
processed
)
Persist first, then process asynchronously. This gives replay, auditability, and crash recovery.
3. Do not trust event order
Use event timestamps carefully or fetch authoritative provider state before applying critical updates.
4. Build reconciliation
Missing events are guaranteed over time. Periodically fetch provider truth, compare with local state, and repair drift.
5. Acknowledge fast, process later
receive_webhook()
store_event()
return 200
Slow in-request processing causes timeouts and retries, increasing duplicate delivery.
6. Use idempotency on outbound provider calls
Refunds, payments, and state-change APIs should include stable idempotency keys to prevent repeated side effects.
7. Handle unknown states gracefully
Duplicate events, deleted resources, and odd transitions should be safe no-ops or reconciled, never crash paths.
Pattern That Works in Production
Webhook -> Store -> Process -> Reconcile.
This model gives durability, observability, and correction mechanisms when delivery is imperfect.
What Not To Do
- Do not treat webhooks as ground truth.
- Do not run heavy business logic in the request handler.
- Do not assume ordered delivery.
- Do not skip periodic reconciliation.
The Mental Shift
Stop thinking: Webhooks tell me what happened.
Start thinking: Webhooks tell me something might have happened. Then verify.
Robust systems treat webhook events as hints, verify real state, and recover from delivery failure.