Engineering

Retry strategies that don't compound errors

Naive retries make failures worse. Engineered retries with idempotency, backoff, and budgets are reliability tools.

Yash ShahApril 30, 20263 min read

A team's "automatic retry on failure" pattern caused a customer-facing incident. The LLM's first call had succeeded but the team's downstream connection had timed out. The retry produced a duplicate output. The user saw the same email twice. Compounding the original failure.

Naive retries make failures worse. Engineered retries with idempotency, backoff, and budgets are tools. The difference is engineering discipline.

Exponential backoff with sense

The pattern that survives:

First retry: short delay (~100ms).
Subsequent retries: exponential backoff with jitter.
Maximum retries: defined.
Maximum total time: defined.

Without backoff, retries hammer providers during their issues, making the issue worse for everyone. With sensible backoff, retries are friendly.

Idempotency

For LLM calls with side effects (any tool call that changes state), idempotency keys:

Each logical operation has a unique key.
Provider deduplicates on the key.
Retries return the original result.

LLM calls themselves are usually idempotent. Tool calls are usually not. The retry policy must distinguish.

Retry budgets

Every operation has a retry budget:

Per-operation: max 3 retries, max 10 seconds total.
Per-task: max 5 total retry events.
Per-user: rate-limited.

Without budgets, an unstable provider can cascade into runaway retry storms. With them, the cost is bounded.

Reviewer signal

When retries spike, the team sees:

Which operation is retrying.
What error type is causing retry.
Whether retries are eventually succeeding.

If retries spike and fail, the upstream issue surfaces. If they spike and succeed, the upstream is unstable but recoverable. Both are useful signals.

A real retry library

A team's retry pattern, abstracted:

@retry(
    max_attempts=3,
    backoff_seconds=lambda n: min(2 ** n, 10) + random.uniform(0, 1),
    retry_on=(ProviderError, TimeoutError),
    idempotency_key=lambda req: req.idempotency_key,
)
def llm_call(req):
    ...

Used consistently across the codebase. Easy to read, easy to verify. Sensible defaults.

What we won't ship

Open-ended retries without budgets.

Retries on operations with side effects without idempotency keys.

Retries that don't change anything. Same prompt, same model, same temperature → same result. If the failure was the LLM's output, change something or surface.

Hidden retries that aren't visible in observability.

Close

Retry strategies are engineering tools. Backoff prevents cascades. Idempotency prevents duplicates. Budgets prevent runaway. Without these, retries make the problem worse. With them, retries are part of a reliable system.

Retry strategies that don't compound errors

Exponential backoff with sense

Idempotency

Retry budgets

Reviewer signal

A real retry library

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Tech lead: PR reviews deeper than 'lgtm'