Why Your Cron Jobs Drift and Eventually Break Everything
You set up a cron job and it looks perfect: runs every 5 minutes, processes pending work, cleans old data, and syncs external systems.
Then production gets weird. Jobs run twice, runs get skipped, and daily tasks no longer align with actual days. Nothing crashes, but your system slowly drifts away from reality.
Cron failures are usually silent, gradual, and expensive.
The Problem: Time Is Not Reliable
Cron assumes scheduled time is reliable. Real systems break that assumption with restarts, clock drift, crashes, rolling deploys, and multiple instances executing the same schedule.
Cron is a best-effort trigger, not a correctness guarantee.
Where It Breaks in Real Systems
1. Missed executions
If the server restarts exactly when a job should fire, that run can be lost forever with no automatic recovery.
2. Duplicate runs in horizontal scale
Multiple app instances all run the same `*/5 * * * *` schedule, so side effects execute multiple times.
3. Overlapping executions
A 6-minute job scheduled every 5 minutes overlaps itself, causing contention and duplicate mutation.
4. Clock drift
Slightly different node clocks cause long-term desynchronization and inconsistent trigger timing.
5. Time-based logic rot
Logic tied to `now() >= scheduled_time` breaks under timezone changes, DST shifts, and clock correction.
The Core Mistake
Treating time as reliable truth instead of an unreliable hint is what causes cron drift outages.
The Fix: Build Systems That Survive Imperfect Timing
1. Drive jobs from state, not schedule ticks
-- Bad
run_every_5_minutes()
process_pending_orders()
-- Good
process_orders_where(status = 'pending')
If one run is missed, next run catches up. If one run duplicates, idempotency protects outcomes.
2. Use distributed locks for singleton execution
INSERT INTO cron_lock (job_name)
VALUES ('cleanup')
ON CONFLICT DO NOTHING;
If insert fails, another instance already owns the run.
3. Track last execution explicitly
Persist last_run_at and process deltas with WHERE updated_at > last_run_at rather
than trusting exact schedule timing.
4. Make jobs re-entrant
Jobs must be safe to start, crash, restart, and continue without corrupting state.
5. Avoid brittle once-per-day checks
-- Avoid
if today != last_run_day then run()
-- Prefer
process_where(processed = false)
State transitions are safer than wall-clock comparisons.
6. Detect and recover missed work
Always scan and reconcile pending/unprocessed state. Never assume skipped cron means no work exists.
7. Separate scheduling from execution
Cron should trigger a check, not perform all work directly. Keep correctness in execution logic.
Pattern That Works in Production
while true do
jobs = fetch_unprocessed_jobs()
for job in jobs do
process_job(job)
end
sleep(60)
end
In this model, correctness comes from state tracking and idempotent processing, not perfect timing.
What Not To Do
- Do not depend on exact cron timing.
- Do not assume jobs always run.
- Do not assume jobs run only once.
- Do not tie business correctness directly to wall-clock checks.
The Mental Shift
Stop asking: Did it run at the exact right time?
Start asking: If it runs late, early, or twice, is the result still correct?
Reliable systems are state-driven, idempotent, and tolerant of timing chaos.