Engineering

Caching deterministic prefixes

Long, repeated prompt prefixes are the cheapest optimisation in LLM ops. Cache them.

Yash ShahApril 10, 20263 min read

A team's customer-support agent had a 4,000-token system prompt + tool definitions + few-shot examples. Every call paid for the same 4,000 tokens of input. With prompt caching, that prefix cost dropped to 10% of the original. Latency dropped meaningfully too.

Caching deterministic prefixes is the cheapest meaningful optimisation in LLM ops. Most teams could be using it and aren't.

Where caching wins

The cache wins when:

The same prefix is used repeatedly.
The prefix is large (>1K tokens).
The provider supports prompt caching (Anthropic, OpenAI, others).
The cache TTL fits the use case.

System prompts, tool definitions, document context that's reused across queries — all good candidates.

TTL design

Cache TTLs are typically 5 minutes to a few hours. The trade-off:

Longer TTL: better hit rate, but cache might be stale if you change the prefix.
Shorter TTL: less staleness risk, lower hit rate.

For most applications, 1 hour is reasonable.

Invalidation

When the prefix changes (prompt update, tool update), the cache invalidates automatically (different prefix → cache miss → new cache entry). The team doesn't have to manage this explicitly.

But: ensure the prefix is bit-stable. Even small variations (extra whitespace, different ordering) defeat the cache.

Reviewer ritual

Cache hit rate is a metric:

Tracked per feature.
Reviewed weekly.
Investigated when it drops (might indicate prefix variations or volume changes).

A real saving

A team's high-volume agent:

Pre-cache: 4,000-token prompt × 100K calls/day = $X/day in input tokens.
Post-cache: 90% cache hit, cached tokens at 10% cost = $0.19X/day.
Net saving: ~80% on input tokens for that feature.

Annualised, the saving was meaningful enough to fund a quarter of engineering work.

What we won't ship

Cache enabled without verifying the prefix is bit-stable.

Cache hit rates that aren't monitored.

Caching across users when the prefix legitimately differs per user (privacy or correctness issues).

Cache TTL so long that prompt updates don't roll out.

Close

Caching deterministic prefixes is high-leverage, low-effort. Most teams can adopt in an afternoon and start saving. The cache hit rate is a metric to monitor. The savings compound at scale.

Caching deterministic prefixes

Where caching wins

TTL design

Invalidation

Reviewer ritual

A real saving

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors