Your Queue Isn't Reliable: Why Jobs Disappear (and How to Make Them Survive)
You added a queue. Good move. Background jobs, async processing, retries, and scalability all look clean until production traffic reveals the ugly edges.
Then weird failures show up: jobs never run, some run twice, some vanish completely, retries do not retry correctly, and completed jobs were never actually completed.
Your system can look stable while the queue silently drops work.
The Problem: Queues Don't Guarantee Execution
Enqueueing a job does not guarantee successful completion. Queues mostly guarantee delivery attempts. Between publish and completion, workers crash, acknowledgements fail, visibility windows expire, and duplicates appear.
Where Jobs Actually Die
1. Worker crash mid-execution
job_id = 42
status = processing
If the worker dies mid-job, behavior depends on queue semantics and your implementation. Jobs may retry, get stuck, or disappear into a ghost state that is hard to detect.
2. Acknowledgement races
Work can complete while the ack fails because of a network blip. The queue retries the same
message, and side effects execute twice.
3. Visibility timeout expiry
In SQS-style systems, a job becomes visible again when timeout expires. Long-running work can be picked up by another worker and run concurrently with the first.
4. Retry logic that replays side effects
retry_count += 1
if retry_count < 3 then retry()
This misses partial execution and previously applied external side effects. You are not retrying a pure function. You are replaying real-world actions.
The Core Mistake
Teams treat the queue as source of truth. It is not. The queue is transport. Job outcome state must live in your own durable data model.
The Fix: Design Jobs That Survive Failure
1. Make every job idempotent
Jobs must be safe to run once, twice, or ten times.
-- Bad
give_player_money(player_id, 1000)
-- Good
grant_reward_if_not_given(player_id, reward_id)
Back this with a uniqueness constraint such as UNIQUE(player_id, reward_id).
2. Persist job state outside the queue
Create and maintain your own job ledger:
jobs (
job_id,
status,
updated_at
)
Track processing, completed, and failed explicitly. This makes stuck-job detection,
safe replays, and audits possible.
3. Separate execution from completion
if not already_completed(job_id) then
process_job()
mark_complete(job_id)
end
Completion is durable state, not a queue assumption.
4. Use idempotency keys for external effects
Payments, email, inventory, and rewards need unique idempotency keys on outbound calls so duplicates no-op safely in downstream systems.
5. Detect and recover stuck jobs
Continuously query long-running processing rows, then reset, requeue, or fail them based on policy.
status = 'processing'
updated_at < now() - interval '5 minutes'
6. Design for duplicate workers
Assume two workers will process the same job. Use constraints, conditional updates, and minimal locking so one succeeds and others safely no-op.
7. Treat at-least-once as maybe-many-times
At-least-once delivery without idempotency protection leads to double charges, double rewards, and duplicated side effects.
Pattern That Works in Production
Use an insert-first execution lock:
INSERT INTO job_execution (job_id)
VALUES (42)
ON CONFLICT DO NOTHING;
If insert succeeds, this worker owns execution. If it fails, another worker already handled it.
What Not To Do
- Do not trust queue "success" as proof of business completion.
- Do not assume jobs run exactly once.
- Do not implement retries without idempotency boundaries.
- Do not store critical state only inside queue metadata.
The Mental Shift
Stop asking: Did my job run?
Start asking: Can this job run any number of times without breaking anything?
Queues are best-effort delivery systems. Your architecture survives when duplicates are safe, state is tracked independently, and failure recovery is built in by default.