Engineering

Golden-set discipline

200 carefully chosen examples beat 2,000 hastily collected ones. Curation is the work.

Yash ShahMarch 20, 20263 min read

A team had 2,400 cases in their eval set. Half were duplicates. A quarter were ambiguous (humans disagreed on the right answer). Most were happy-path; the team's actual production failures didn't appear.

Golden sets aren't measured by size. They're measured by signal. A 200-case set that's curated beats a 2,400-case set that isn't.

Curation rules

What earns a place in the golden set:

Representativeness. Reflects production traffic distribution.
Diversity. Spans the input space.
Distinctness. Each case tests something specific.
Verifiability. Humans agree on the right answer.

Cases that fail any rule get out.

The 200 that earn their place

A typical mature eval has:

50-100 happy-path cases (representative of common production traffic).
30-50 edge cases (boundaries, unusual but valid inputs).
30-50 adversarial cases (attempts to break the system).
20-30 known-regression cases (specific bugs that were fixed; should stay fixed).

The set is sized to be reviewable. A 200-case set can be re-run in CI quickly. A 2,000-case set can't.

Reviewer ritual

The eval set is reviewed quarterly:

Are cases still representative?
Are new patterns missing from the set?
Are stale cases retiring?
Is the size still right?

Coverage

Coverage is measured against:

Input dimensions (the set spans the input variability).
Failure modes (each known failure mode has cases).
Edge cases (boundaries are tested).
Production distribution (the set roughly matches what's seen in production).

A real set

A team's customer-classifier golden set:

80 happy-path cases (representative ticket types).
40 edge cases (unusual phrasings, boundary conditions).
25 adversarial (prompt injection attempts).
15 known-regression (bugs that were fixed).
Total: 160 cases.

Reviewable in 30 minutes. Re-runnable in 5. The team trusts the set's signal.

What we won't ship

Eval sets without curation.

Eval sets that only grow (without retirement).

Eval sets built solely from imagined cases.

Skipping the quarterly review.

Close

Golden-set discipline is the curation work that makes evals trustworthy. 200 cases that earn their place beat 2,000 that don't. The set is reviewable, representative, distinct. The team's signal is high. Skip curation and the eval becomes noise.

Golden-set discipline

Curation rules

The 200 that earn their place

Reviewer ritual

Coverage

A real set

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors