Your Queue Isn't Reliable: Why Jobs Disappear (and How to Make Them Survive)

You added a queue. Good move. Background jobs, async processing, retries, and scalability all look clean until production traffic reveals the ugly edges.

Then weird failures show up: jobs never run, some run twice, some vanish completely, retries do not retry correctly, and completed jobs were never actually completed.

Your system can look stable while the queue silently drops work.

The Problem: Queues Don't Guarantee Execution

Enqueueing a job does not guarantee successful completion. Queues mostly guarantee delivery attempts. Between publish and completion, workers crash, acknowledgements fail, visibility windows expire, and duplicates appear.

Where Jobs Actually Die

1. Worker crash mid-execution

job_id = 42
status = processing

If the worker dies mid-job, behavior depends on queue semantics and your implementation. Jobs may retry, get stuck, or disappear into a ghost state that is hard to detect.

2. Acknowledgement races

Work can complete while the ack fails because of a network blip. The queue retries the same message, and side effects execute twice.

3. Visibility timeout expiry

In SQS-style systems, a job becomes visible again when timeout expires. Long-running work can be picked up by another worker and run concurrently with the first.

4. Retry logic that replays side effects

retry_count += 1
if retry_count < 3 then retry()

This misses partial execution and previously applied external side effects. You are not retrying a pure function. You are replaying real-world actions.

The Core Mistake

Teams treat the queue as source of truth. It is not. The queue is transport. Job outcome state must live in your own durable data model.

The Fix: Design Jobs That Survive Failure

1. Make every job idempotent

Jobs must be safe to run once, twice, or ten times.

-- Bad
give_player_money(player_id, 1000)

-- Good
grant_reward_if_not_given(player_id, reward_id)

Back this with a uniqueness constraint such as UNIQUE(player_id, reward_id).

2. Persist job state outside the queue

Create and maintain your own job ledger:

jobs (
  job_id,
  status,
  updated_at
)

Track processing, completed, and failed explicitly. This makes stuck-job detection, safe replays, and audits possible.

3. Separate execution from completion

if not already_completed(job_id) then
  process_job()
  mark_complete(job_id)
end

Completion is durable state, not a queue assumption.

4. Use idempotency keys for external effects

Payments, email, inventory, and rewards need unique idempotency keys on outbound calls so duplicates no-op safely in downstream systems.

5. Detect and recover stuck jobs

Continuously query long-running processing rows, then reset, requeue, or fail them based on policy.

status = 'processing'
updated_at < now() - interval '5 minutes'

6. Design for duplicate workers

Assume two workers will process the same job. Use constraints, conditional updates, and minimal locking so one succeeds and others safely no-op.

7. Treat at-least-once as maybe-many-times

At-least-once delivery without idempotency protection leads to double charges, double rewards, and duplicated side effects.

Pattern That Works in Production

Use an insert-first execution lock:

INSERT INTO job_execution (job_id)
VALUES (42)
ON CONFLICT DO NOTHING;

If insert succeeds, this worker owns execution. If it fails, another worker already handled it.

What Not To Do

The Mental Shift

Stop asking: Did my job run?

Start asking: Can this job run any number of times without breaking anything?

Queues are best-effort delivery systems. Your architecture survives when duplicates are safe, state is tracked independently, and failure recovery is built in by default.