Engineering

Judging open-ended output without a rubric

Open-ended outputs can be judged with discipline. The rubric is the work.

Yash ShahMarch 26, 20262 min read

A team's prose-generation feature produced varied outputs. They couldn't grade against a single expected answer. They needed a rubric that captured what "good" meant without exact-match.

Open-ended outputs are gradable when the rubric is rigorous.

The rubric discipline

A good rubric has:

Specific dimensions (clarity, accuracy, tone, etc.).
Clear definitions for each dimension.
Examples per dimension (1-5 score with anchors).
Scope (what's in scope; what's out).

Without these, raters disagree wildly. With them, agreement climbs.

Reviewer ritual

Rubric reviewed:

After each iteration of the feature.
When inter-rater agreement is low.
When the output space changes.

A real rubric

For a customer-email-generation feature:

| Dimension | 1 | 3 | 5 | | --- | --- | --- | --- | | Tone | Off-brand | Acceptable | On-brand | | Clarity | Unclear | Acceptable | Crystal clear | | Length | Wrong length | Acceptable | Optimal | | Specificity | Generic | Acceptable | Specific to context | | Helpfulness | Doesn't help | Helps somewhat | Directly addresses need |

Each dimension has expanded definitions and examples per score.

Trade-offs

Rubric design:

Slow to build initially.
Pays off in agreement and signal.
Needs maintenance as the feature evolves.

The team's investment in the rubric is investment in the feature's quality.

Limits

Some judgments are genuinely ambiguous. Rubrics can't fix this:

"Was this funny?" — partly subjective.
"Was this culturally appropriate?" — context-dependent.
"Was this useful?" — depends on the user.

For these, the rubric provides structure but the team accepts disagreement.

What we won't ship

Open-ended evals without rubrics.

Rubrics without examples.

Rubrics that don't get inter-rater agreement.

Skipping the rubric maintenance.

Close

Judging open-ended output requires rigorous rubrics. Specific dimensions, clear definitions, anchored examples. The team's evals become reliable. The feature improves measurably because the team can grade it.

Judging open-ended output without a rubric

The rubric discipline

Reviewer ritual

A real rubric

Trade-offs

Limits

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors