Engineering

Determinism harnesses for non-deterministic systems

Record and replay turns the LLM into a deterministic component for testing purposes.

Yash ShahApril 30, 20262 min read

A team's integration tests called the real LLM. The tests were slow and flaky — same input, slightly different output. The team adopted a record-replay pattern: real LLM in development, recorded responses in test runs.

This is the determinism harness. It makes a non-deterministic system testable.

The mock-with-record pattern

The pattern:

Tests run with a mock LLM client.
The mock returns recorded responses for each input.
Recordings are versioned in the repo.
New tests record on first run; subsequent runs replay.

For each input, the mock looks up the recording. If found, returns it. If not, calls the real LLM, records, and returns.

Replay discipline

When the prompt changes:

The recordings are stale.
Re-record explicitly.
Review the new recordings for sanity.
Commit the new recordings.

Without this, prompt changes silently mismatch with stale recordings. Tests pass but don't reflect new behaviour.

Reviewer ritual

PRs that re-record include:

The new recordings in the diff.
A note about why recordings updated.
Reviewer scans the changed responses for unexpected patterns.

A real harness

A team's harness:

pytest-vcr style: each test has a YAML recording file.
First run records; subsequent runs replay.
Recordings are sanitised (PII redacted, secrets removed).
Clear separation between development tests (real calls) and CI tests (recorded).

Limits

The harness has limits:

It tests the integration layer. The actual LLM behaviour is frozen.
Behavioural eval still requires real calls (less frequently).
Recordings can drift from real behaviour over time.

A team's strategy: integration tests use recordings (fast); behavioural eval uses real calls (slower, less frequent).

What we won't ship

Tests that always call the real LLM (too slow, too flaky).

Tests that always use recordings (miss real behavioural changes).

Recordings that aren't reviewed when re-recorded.

Recordings with PII or secrets. Always sanitise.

Close

Determinism harnesses make LLM-dependent code testable cheaply. Record-replay is the pattern. Recordings stay current. Real-call evals run less often. The team's CI stays fast and comprehensive.

Determinism harnesses for non-deterministic systems

The mock-with-record pattern

Replay discipline

Reviewer ritual

A real harness

Limits

What we won't ship

Close

Related reading

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors

Tech lead: PR reviews deeper than 'lgtm'