April 16, 2026 - 9 min read - Distributed Systems

Exactly Once Processing is a Lie (Here's What Actually Works)

Everyone wants exactly-once processing: run one job once, charge once, send one email, apply one state change. It sounds clean, predictable, and safe.

In real distributed systems, that guarantee does not exist. You cannot assume perfect network conditions, perfect coordination, zero retries, and zero crashes.

The Problem: Exactly Once Is Not a Real Guarantee

To guarantee true exactly-once behavior end to end, every layer would need flawless reliability: queue transport, worker runtime, storage, external APIs, and acknowledgement flow.

What you actually have in production:

At-least-once delivery from queues
Maybe-once behavior under outages and dropped retries
Unknown execution state after crashes

So in practice, operations can run multiple times or not run at all. Correctness must come from your design, not a transport promise.

Where the Illusion Comes From

1. It worked in testing

Local environments rarely include latency, network partitions, process restarts, and retry storms. That hides duplicate and lost execution paths.

2. The queue says exactly once

Queue vendors often mean they minimize duplicate delivery attempts. That is not the same as globally exactly-once side effects.

3. We use transactions

Transactions protect one database operation scope. They do not make external API calls, queue delivery, and side effects exactly once.

The Real Failure Modes

Duplicate execution

A timeout triggers a retry while the original worker still runs. Both complete and apply side effects.

Partial execution

A job updates the database, then crashes before completion state is recorded. The retry repeats the mutation.

Lost execution

A job is never picked up or fails silently without recovery. The expected side effect never happens.

The Core Truth

You cannot guarantee "this runs exactly once." You can guarantee "this produces a correct result no matter how many times it runs."

What Actually Works

1. Idempotency everywhere

Every operation must be safe to repeat.

-- Bad
charge_user(user_id, 100)

-- Good
charge_user_if_not_already_charged(user_id, payment_id)

Back it with a uniqueness boundary such as UNIQUE(payment_id).

2. Write-first ownership

Claim ownership atomically before execution:

INSERT INTO processed_jobs (job_id)
VALUES (X)
ON CONFLICT DO NOTHING;

Proceed only when insert succeeds.

3. State machines, not boolean flags

Use guarded transitions instead of one completed flag:

UPDATE jobs
SET status = 'processing'
WHERE id = X AND status = 'pending';

If zero rows update, another worker already owns the transition.

4. External idempotency keys

For payments, email, inventory, and rewards, include a stable idempotency key on outbound calls. If the provider does not support this, simulate dedup on your side.

5. Accept at-least-once as reality

Assume duplicates, reordering, and retries are normal. Build correctness on top of that behavior.

Production Pattern

Use idempotent execution lock plus completion tracking:

INSERT INTO job_lock (job_id)
VALUES (X)
ON CONFLICT DO NOTHING;

If insert fails, the job has already been claimed or handled.

What Not To Do

Do not trust queue marketing terms as business correctness guarantees.
Do not assume retries are safe without idempotent boundaries.
Do not accept "it rarely duplicates" as a reliability strategy.
Do not engineer around ideal conditions that production never has.

The Mental Shift

Stop asking: How do I make this run exactly once?

Start asking: How do I keep outcomes correct even if this runs many times?

Systems survive when they tolerate retries, duplicate delivery, and partial failure without corrupting state.