Engineering

Mock LLMs in tests: when to fake, when to call

Mocks are fast; real calls are accurate. The right mix depends on what's being tested.

Yash ShahMarch 12, 20263 min read

A team's mock LLM in tests returned hard-coded responses. Tests were fast and stable. They also hadn't caught a single real bug in 8 months. The mock had drifted from reality without anyone noticing.

Mocks are fast; real calls are accurate. The mix matters.

The faking rules

Use mocks for:

Integration testing where the LLM's behaviour isn't what's being tested.
Tests that run on every PR (where speed matters).
Tests that exercise error handling (mock the error).
Tests that exercise specific output shapes.

Use real calls for:

Behavioural eval.
Drift detection.
Production-like testing before release.
Periodic verification that mocks still represent reality.

Real-call cadence

Real calls run:

Nightly.
Pre-release.
On significant prompt changes.

The cadence is "less than every PR, more than never."

Reviewer ritual

When a real-call test fails but the mocked test passes:

The mock has drifted.
Investigate.
Update the mock.
Add a regression test.

A real strategy

A team's setup:

200 unit/integration tests with mocks. Run on every PR. under 1 minute total.
50 behavioural tests with real calls. Run nightly. ~10 minutes.
30 eval cases on a quarterly schedule with diverse prompts.

The team catches:

Refactoring regressions: mocked tests catch them on PR.
Prompt regressions: behavioural tests catch them within a day.
Drift: nightly runs catch trends.

Cost shape

Mocked tests are essentially free. Real-call tests cost money — but the costs are bounded by cadence.

Annual cost of nightly behavioural tests: a small fraction of the team's overall LLM bill. The reliability gain is worth it.

What we won't ship

Mocks that haven't been verified against real behaviour.

Real-call tests on every PR (too slow, too costly).

No real-call tests at all. Drift goes undetected.

Mock responses that are unrealistic.

Close

Mocks and real calls each have their place. Mocks for speed, real calls for accuracy. The right mix is a strategy, not a default. The team that gets it right ships fast and catches regressions; the team that gets it wrong ships slow or misses regressions.

Mock LLMs in tests: when to fake, when to call

The faking rules

Real-call cadence

Reviewer ritual

A real strategy

Cost shape

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors