Why Your Cron Jobs Drift and Eventually Break Everything

You set up a cron job and it looks perfect: runs every 5 minutes, processes pending work, cleans old data, and syncs external systems.

Then production gets weird. Jobs run twice, runs get skipped, and daily tasks no longer align with actual days. Nothing crashes, but your system slowly drifts away from reality.

Cron failures are usually silent, gradual, and expensive.

The Problem: Time Is Not Reliable

Cron assumes scheduled time is reliable. Real systems break that assumption with restarts, clock drift, crashes, rolling deploys, and multiple instances executing the same schedule.

Cron is a best-effort trigger, not a correctness guarantee.

Where It Breaks in Real Systems

1. Missed executions

If the server restarts exactly when a job should fire, that run can be lost forever with no automatic recovery.

2. Duplicate runs in horizontal scale

Multiple app instances all run the same `*/5 * * * *` schedule, so side effects execute multiple times.

3. Overlapping executions

A 6-minute job scheduled every 5 minutes overlaps itself, causing contention and duplicate mutation.

4. Clock drift

Slightly different node clocks cause long-term desynchronization and inconsistent trigger timing.

5. Time-based logic rot

Logic tied to `now() >= scheduled_time` breaks under timezone changes, DST shifts, and clock correction.

The Core Mistake

Treating time as reliable truth instead of an unreliable hint is what causes cron drift outages.

The Fix: Build Systems That Survive Imperfect Timing

1. Drive jobs from state, not schedule ticks

-- Bad
run_every_5_minutes()
process_pending_orders()

-- Good
process_orders_where(status = 'pending')

If one run is missed, next run catches up. If one run duplicates, idempotency protects outcomes.

2. Use distributed locks for singleton execution

INSERT INTO cron_lock (job_name)
VALUES ('cleanup')
ON CONFLICT DO NOTHING;

If insert fails, another instance already owns the run.

3. Track last execution explicitly

Persist last_run_at and process deltas with WHERE updated_at > last_run_at rather than trusting exact schedule timing.

4. Make jobs re-entrant

Jobs must be safe to start, crash, restart, and continue without corrupting state.

5. Avoid brittle once-per-day checks

-- Avoid
if today != last_run_day then run()

-- Prefer
process_where(processed = false)

State transitions are safer than wall-clock comparisons.

6. Detect and recover missed work

Always scan and reconcile pending/unprocessed state. Never assume skipped cron means no work exists.

7. Separate scheduling from execution

Cron should trigger a check, not perform all work directly. Keep correctness in execution logic.

Pattern That Works in Production

while true do
  jobs = fetch_unprocessed_jobs()

  for job in jobs do
    process_job(job)
  end

  sleep(60)
end

In this model, correctness comes from state tracking and idempotent processing, not perfect timing.

What Not To Do

The Mental Shift

Stop asking: Did it run at the exact right time?

Start asking: If it runs late, early, or twice, is the result still correct?

Reliable systems are state-driven, idempotent, and tolerant of timing chaos.