Webhooks Are Not Reliable: Designing for Duplicate and Missing Events

Webhook integrations look clean in early builds: provider sends event, you receive it, you process it, and you are done.

In production, behavior gets chaotic. Some events never arrive, some arrive twice, others arrive out of order, and many show up late.

You do not control webhook delivery guarantees.

The Problem: Webhooks Are Best-Effort Delivery

Webhooks are just network requests. They fail, retry, reorder, and delay. Any system that assumes one clean event per state change will eventually corrupt state.

Where It Breaks in Real Systems

1. Duplicate delivery

Provider sends payment.succeeded. You process it, but slow acknowledgment triggers provider retry. The same event arrives again.

Without protection, you credit twice, grant rewards twice, or trigger duplicate workflows.

2. Missing events

Temporary network failures, timeouts, or downtime can exhaust provider retries. The event is lost.

3. Out-of-order events

You can receive subscription.cancelled before subscription.created. If logic trusts event order, final state becomes wrong.

4. Delayed delivery

Event happens at 10:00, webhook arrives at 10:07. Your system already made decisions using incomplete data.

The Core Mistake

Treating webhooks as source of truth is the failure. They are unreliable notifications only.

The Fix: Design for Unreliable Events

1. Make handlers idempotent

-- Bad
grant_reward(user_id)

-- Good
grant_reward_if_not_already_given(user_id, event_id)

Enforce with a unique constraint like UNIQUE(event_id).

2. Store every webhook event

Do not execute business logic directly inside the request handler.

webhook_events (
  event_id,
  type,
  payload,
  processed
)

Persist first, then process asynchronously. This gives replay, auditability, and crash recovery.

3. Do not trust event order

Use event timestamps carefully or fetch authoritative provider state before applying critical updates.

4. Build reconciliation

Missing events are guaranteed over time. Periodically fetch provider truth, compare with local state, and repair drift.

5. Acknowledge fast, process later

receive_webhook()
store_event()
return 200

Slow in-request processing causes timeouts and retries, increasing duplicate delivery.

6. Use idempotency on outbound provider calls

Refunds, payments, and state-change APIs should include stable idempotency keys to prevent repeated side effects.

7. Handle unknown states gracefully

Duplicate events, deleted resources, and odd transitions should be safe no-ops or reconciled, never crash paths.

Pattern That Works in Production

Webhook -> Store -> Process -> Reconcile.

This model gives durability, observability, and correction mechanisms when delivery is imperfect.

What Not To Do

The Mental Shift

Stop thinking: Webhooks tell me what happened.

Start thinking: Webhooks tell me something might have happened. Then verify.

Robust systems treat webhook events as hints, verify real state, and recover from delivery failure.