Engineering

Per-feature evals vs. per-model evals

Different scopes need different evals. The choice shapes the discipline.

Yash ShahMarch 11, 20262 min read

A team had a single eval set that covered everything. When they shipped a new feature, the eval got bigger. When they bumped the model, they ran the same eval. Neither scope was being tested cleanly.

Per-feature evals catch feature-specific issues. Per-model evals catch model-specific issues. Different scopes; different evals.

The two scopes

Per-feature evals. Focused on a specific product feature. The eval cases all exercise that feature. Used for PR review, feature-specific regression detection.

Per-model evals. Focused on the team's general LLM capability. Cases cover varied features and use cases. Used for model-bump decisions, broad-quality monitoring.

When each wins

Per-feature: when changes are feature-scoped. Most PRs.
Per-model: when changes are model-scoped. Provider bumps. Major prompt overhauls.

A team needs both. The CI runs feature evals on PRs. The model-bump process runs per-model evals.

Reviewer ritual

PR review:

Which eval ran?
Was the right scope tested?
Are there cross-scope concerns?

A real mix

A team's setup:

8 per-feature evals (one per shipped feature).
1 per-model eval (300 cases spanning use cases).
Per-feature run on every PR touching that feature.
Per-model run quarterly + on model bumps.

Per-feature catches feature regressions. Per-model catches model-quality shifts.

Trade-offs

More eval suites = more maintenance.
Single eval = less coverage of scope-specific issues.

Most teams need 5-15 per-feature evals plus a per-model eval. More than that becomes hard to maintain.

What we won't ship

Single eval suite for diverse features.

Per-feature evals without per-model coverage.

Per-model evals without per-feature coverage.

Skipping the model-bump per-model eval run.

Close

Per-feature and per-model evals serve different purposes. The team needs both. Different cadences, different content. Skip either and the eval has blind spots.

Per-feature evals vs. per-model evals

The two scopes

When each wins

Reviewer ritual

A real mix

Trade-offs

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors