Exactly Once Processing is a Lie (Here's What Actually Works)
Everyone wants exactly-once processing: run one job once, charge once, send one email, apply one state change. It sounds clean, predictable, and safe.
In real distributed systems, that guarantee does not exist. You cannot assume perfect network conditions, perfect coordination, zero retries, and zero crashes.
The Problem: Exactly Once Is Not a Real Guarantee
To guarantee true exactly-once behavior end to end, every layer would need flawless reliability: queue transport, worker runtime, storage, external APIs, and acknowledgement flow.
What you actually have in production:
- At-least-once delivery from queues
- Maybe-once behavior under outages and dropped retries
- Unknown execution state after crashes
So in practice, operations can run multiple times or not run at all. Correctness must come from your design, not a transport promise.
Where the Illusion Comes From
1. It worked in testing
Local environments rarely include latency, network partitions, process restarts, and retry storms. That hides duplicate and lost execution paths.
2. The queue says exactly once
Queue vendors often mean they minimize duplicate delivery attempts. That is not the same as globally exactly-once side effects.
3. We use transactions
Transactions protect one database operation scope. They do not make external API calls, queue delivery, and side effects exactly once.
The Real Failure Modes
Duplicate execution
A timeout triggers a retry while the original worker still runs. Both complete and apply side effects.
Partial execution
A job updates the database, then crashes before completion state is recorded. The retry repeats the mutation.
Lost execution
A job is never picked up or fails silently without recovery. The expected side effect never happens.
The Core Truth
You cannot guarantee "this runs exactly once." You can guarantee "this produces a correct result no matter how many times it runs."
What Actually Works
1. Idempotency everywhere
Every operation must be safe to repeat.
-- Bad
charge_user(user_id, 100)
-- Good
charge_user_if_not_already_charged(user_id, payment_id)
Back it with a uniqueness boundary such as UNIQUE(payment_id).
2. Write-first ownership
Claim ownership atomically before execution:
INSERT INTO processed_jobs (job_id)
VALUES (X)
ON CONFLICT DO NOTHING;
Proceed only when insert succeeds.
3. State machines, not boolean flags
Use guarded transitions instead of one completed flag:
UPDATE jobs
SET status = 'processing'
WHERE id = X AND status = 'pending';
If zero rows update, another worker already owns the transition.
4. External idempotency keys
For payments, email, inventory, and rewards, include a stable idempotency key on outbound calls. If the provider does not support this, simulate dedup on your side.
5. Accept at-least-once as reality
Assume duplicates, reordering, and retries are normal. Build correctness on top of that behavior.
Production Pattern
Use idempotent execution lock plus completion tracking:
INSERT INTO job_lock (job_id)
VALUES (X)
ON CONFLICT DO NOTHING;
If insert fails, the job has already been claimed or handled.
What Not To Do
- Do not trust queue marketing terms as business correctness guarantees.
- Do not assume retries are safe without idempotent boundaries.
- Do not accept "it rarely duplicates" as a reliability strategy.
- Do not engineer around ideal conditions that production never has.
The Mental Shift
Stop asking: How do I make this run exactly once?
Start asking: How do I keep outcomes correct even if this runs many times?
Systems survive when they tolerate retries, duplicate delivery, and partial failure without corrupting state.