Engineering

Eval-driven prompt iteration

The eval is the spec. Prompt iteration converges to it.

Yash ShahApril 6, 20262 min read

Prompt iteration without an eval is vibes. Prompt iteration with an eval is engineering.

The eval defines the target. Each prompt change is measured against it. The prompt converges to the spec.

The loop

Eval set authored (or borrowed).
Initial prompt drafted.
Eval run. Score baseline.
Iterate prompt. Re-eval.
If score improves, keep. If not, revert.
Repeat until threshold met.
Ship.

Each iteration is data-driven. The team's intuition guides; the eval gates.

Reviewer ritual

PR for prompt change:

Before/after eval scores.
Per-cohort breakdowns.
Justification for the change.

If the score regressed but the engineer thinks it's an improvement (eval set is wrong), they argue with eval-set updates, not with subjective preference.

A real workflow

A team's classifier improvement:

Day 1: 91% baseline.
Day 2: prompt change → 88% (revert).
Day 3: different prompt change → 93% (keep).
Day 4: refinement → 94% (keep).
Day 5: refinement → 94% (keep, no change).
Day 6: ship.

Six days; data-driven; converged.

Without the eval, this would have been "engineer thought it was better; shipped; eventually noticed regression."

Trade-offs

Slow: each iteration costs time + eval-run cost.
Reliable: changes that pass are real improvements.

The slowness is the discipline. Skipping the discipline accumulates regressions.

Limits

Eval set must be representative. Otherwise the prompt over-fits the eval.
Eval set must grow. Otherwise it becomes stale.
Some improvements are visible only at production scale.

What we won't ship

Prompt changes without eval evidence.

Eval scores that don't convince the team the change is real.

Iteration that targets eval set without considering production traffic.

Prompts that pass eval and fail production.

Close

Eval-driven prompt iteration is the discipline of engineered prompt-engineering. The eval is the spec. Each change is measured. Convergence is data-driven. Skip the discipline and prompts evolve by vibes.

Eval-driven prompt iteration

The loop

Reviewer ritual

A real workflow

Trade-offs

Limits

What we won't ship

Close

Related reading

The AI productivity playbook: a real engineer's day

Claude Code + PostHog: analytics-aware development

Claude Code + Sentry: incident debugging as conversation