Jaypore Labs
Back to journal
Engineering

Retry strategies that don't compound errors

Naive retries make failures worse. Engineered retries with idempotency, backoff, and budgets are reliability tools.

Yash ShahApril 30, 20263 min read

A team's "automatic retry on failure" pattern caused a customer-facing incident. The LLM's first call had succeeded but the team's downstream connection had timed out. The retry produced a duplicate output. The user saw the same email twice. Compounding the original failure.

Naive retries make failures worse. Engineered retries with idempotency, backoff, and budgets are tools. The difference is engineering discipline.

Exponential backoff with sense

The pattern that survives:

  • First retry: short delay (~100ms).
  • Subsequent retries: exponential backoff with jitter.
  • Maximum retries: defined.
  • Maximum total time: defined.

Without backoff, retries hammer providers during their issues, making the issue worse for everyone. With sensible backoff, retries are friendly.

Idempotency

For LLM calls with side effects (any tool call that changes state), idempotency keys:

  • Each logical operation has a unique key.
  • Provider deduplicates on the key.
  • Retries return the original result.

LLM calls themselves are usually idempotent. Tool calls are usually not. The retry policy must distinguish.

Retry budgets

Every operation has a retry budget:

  • Per-operation: max 3 retries, max 10 seconds total.
  • Per-task: max 5 total retry events.
  • Per-user: rate-limited.

Without budgets, an unstable provider can cascade into runaway retry storms. With them, the cost is bounded.

Reviewer signal

When retries spike, the team sees:

  • Which operation is retrying.
  • What error type is causing retry.
  • Whether retries are eventually succeeding.

If retries spike and fail, the upstream issue surfaces. If they spike and succeed, the upstream is unstable but recoverable. Both are useful signals.

A real retry library

A team's retry pattern, abstracted:

@retry(
    max_attempts=3,
    backoff_seconds=lambda n: min(2 ** n, 10) + random.uniform(0, 1),
    retry_on=(ProviderError, TimeoutError),
    idempotency_key=lambda req: req.idempotency_key,
)
def llm_call(req):
    ...

Used consistently across the codebase. Easy to read, easy to verify. Sensible defaults.

What we won't ship

Open-ended retries without budgets.

Retries on operations with side effects without idempotency keys.

Retries that don't change anything. Same prompt, same model, same temperature → same result. If the failure was the LLM's output, change something or surface.

Hidden retries that aren't visible in observability.

Close

Retry strategies are engineering tools. Backoff prevents cascades. Idempotency prevents duplicates. Budgets prevent runaway. Without these, retries make the problem worse. With them, retries are part of a reliable system.

Related reading


We build AI-enabled software and help businesses put AI to work. If you're tightening retry discipline, we'd love to hear about it. Get in touch.

Tagged
LLMRetryEngineeringPredictable OutputReliability
Share