April 21, 2026 - 10 min read - Failure Handling

Retries Are Not Safe: Why "Just Try Again" Eventually Breaks Your System

Something fails, so you add retries. It looks harmless: loop a few attempts and move on.

In production, unmanaged retries can trigger retry storms, queue blowups, and cascading outages. Your system starts attacking itself when it is already weak.

Retries can increase reliability, or multiply failure. Design decides which one.

The Problem: Retries Amplify Pressure

Retries do not just replay success paths. They replay overloaded paths, partial failures, and ambiguous outcomes, exactly when downstream systems need less traffic.

Where It Breaks in Real Systems

1. Retry storms

A failing dependency causes each request to retry multiple times. Traffic multiplies geometrically.

2. Retrying non-retryable failures

Validation errors and business-rule failures do not recover with retries. Repeating them wastes compute and pollutes logs.

3. Duplicate side effects

Retries can execute operations twice: duplicate charges, duplicate emails, duplicate mutations.

4. Timeout ambiguity

Request times out, client retries, original request still completes. Both paths succeed and side effects duplicate.

5. Queue explosion loops

Failed jobs retry aggressively, backlog grows, latency rises, more jobs time out, and retries increase again.

The Core Mistake

Treating retries as a harmless fallback instead of a high-risk traffic multiplier.

The Fix: Controlled and Bounded Retries

1. Retry only retryable failures

Retry transient conditions like timeouts, 429s, and temporary unavailability.

Do not retry validation errors, missing resources, or business logic rejections.

2. Enforce idempotency

-- Bad
charge_user(user_id, 100)

-- Good
charge_user_if_not_processed(payment_id)

Use unique keys such as UNIQUE(payment_id).

3. Use exponential backoff

1s -> 2s -> 4s -> 8s

Immediate retries amplify load. Backoff gives dependencies recovery time.

4. Add jitter

delay = base * random(0.5, 1.5)

Jitter prevents synchronized retry spikes.

5. Cap retries hard

Set max attempts and time windows. After limits, mark failed and route to a dead-letter queue.

6. Track retry state explicitly

Persist retry_count, last_attempt_at, and status for visibility and recovery.

7. Use dead-letter queues

Failed items should be inspectable and recoverable, never silently dropped.

8. Protect downstream with circuit breakers

When a dependency is failing, reduce or pause retry pressure to avoid making the outage worse.

Pattern That Works in Production

Attempt -> classify failure -> schedule retry with backoff+jitter -> cap -> DLQ.

This gives control, observability, and graceful degradation under failure.

What Not To Do

Do not retry everything.
Do not retry immediately.
Do not retry side-effecting calls without idempotency.
Do not retry forever.

The Mental Shift

Stop thinking: If it fails, just try again.

Start thinking: Why did it fail, and is retry safe right now?

Stable systems retry intentionally, bound aggressively, and fail visibly.