Engineering

Counter-example mining

Finding the cases your prompt fails on is the highest-leverage eval-improvement work.

Yash ShahApril 15, 20263 min read

A team's prompt was at 95% accuracy on the eval and stuck there. Iterating the prompt didn't help. Iterating the eval set — adding cases the prompt was likely to fail on — did. Within a quarter, accuracy was at 98%.

Finding the cases your prompt fails on is the highest-leverage eval-improvement work. Counter-example mining is the discipline.

Where prompts fail

Common patterns:

Edge cases. Inputs the prompt's writer didn't anticipate.
Ambiguity. Inputs that could be interpreted multiple ways.
Boundary conditions. Maximum/minimum values, empty inputs, exotic characters.
Compositional. Complex inputs combining edge cases.
Real-world variation. Inputs that look different in production than in eval.

Each pattern is a category to mine.

Mining workflow

The discipline:

Sample production traffic.
Filter to cases the model was uncertain about (low confidence, retry, escalation).
Have humans review.
Add the failures to the eval set.
Iterate the prompt.

Production failures are the gold standard for eval-set growth.

Reviewer ritual

Periodic — weekly for high-stakes agents:

Sample of model failures from production.
Reviewer triages.
New eval cases authored.
Old eval cases that no longer represent the workload retired.

A real mining

A team's classifier had 95% accuracy. Mining surfaced:

12 cases where users used industry jargon the prompt didn't handle.
8 cases where the input was ambiguous.
5 cases of boundary conditions (very short, very long).
3 cases of compositional complexity.

Adding these 28 cases to the eval and iterating the prompt: 98% accuracy.

The mined cases were also signals about prompt evolution. The prompt was updated to handle the patterns.

Coverage

Coverage isn't just count. It's diversity:

Edge cases per input dimension.
Distribution of input types.
Real-world distribution match.
Adversarial cases.

A 1000-case eval that's mostly happy path covers less than a 200-case eval that spans diversity.

What we won't ship

Eval sets that aren't growing.

Eval sets assembled only from imagined cases.

Iteration cycles that don't include eval-set growth.

Counter-example mining that doesn't influence the prompt.

Close

Counter-example mining is the eval discipline that makes prompts robust. Production failures inform the eval set. The eval set drives prompt iteration. The prompt improves on the cases that actually fail in production. Skip the mining and the prompt over-fits to the imagined eval.

Counter-example mining

Where prompts fail

Mining workflow

Reviewer ritual

A real mining

Coverage

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors