Engineering

Eval taxonomy: golden, behavioural, drift, safety

Different evals for different jobs. The taxonomy lets the team mix appropriately.

Yash ShahMarch 18, 20263 min read

A team's "eval set" was actually four different things mixed together: known-good cases, behaviour rubrics, snapshots, and adversarial cases. Mixing them meant pass/fail signals were ambiguous. Splitting them clarified.

The eval taxonomy: golden, behavioural, drift, safety. Each has its own purpose, its own metrics, its own cadence.

The four types

Golden. Specific input → specific expected output. High-confidence cases. Pass means the system handled this case correctly.

Behavioural. Input → required behaviour properties. Output may vary; the properties hold. Pass means the system followed the rules.

Drift. Input → output compared to baseline. Pass means the output hasn't changed unexpectedly.

Safety. Adversarial input → expected refusal or routing. Pass means the system handled the attack correctly.

When each wins

Golden: classification, extraction, deterministic transformations.
Behavioural: prose generation, summaries, conversation.
Drift: anything where unexpected changes are bad.
Safety: anywhere with adversarial users.

Most production systems need all four.

Reviewer ritual

PR review:

Which eval types ran?
What were the results per type?
Are there gaps in coverage by type?

A real mix

A team's customer-support agent eval:

Golden: 50 cases (FAQ-style; correct answer is known).
Behavioural: 30 cases (free-form conversations; rubric scores).
Drift: 40 reference outputs (significant patterns).
Safety: 60 adversarial cases (jailbreaks, injection).

Total: 180 cases across four eval types. Each type has its own pass-rate threshold and CI behaviour.

How to start

Most teams start with golden. They add behavioural when the feature does prose generation. They add drift when production stability matters. They add safety when adversarial users appear.

The progression is normal. The team eventually has all four for any meaningful feature.

What we won't ship

One-eval-type does all. Different jobs need different evals.

Skipping safety evals for user-facing AI.

Drift evals without clear baseline reference.

Behavioural evals without calibrated rubrics.

Close

Eval taxonomy is the discipline of using the right eval for the right job. Golden, behavioural, drift, safety. Each has its own purpose. The team's eval suite is comprehensive because each type catches what others miss.

Eval taxonomy: golden, behavioural, drift, safety

The four types

When each wins

Reviewer ritual

A real mix

How to start

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors