Engineering

Self-consistency: when N=3 beats a smarter prompt

Generating multiple candidates and selecting the most consistent one beats a cleverer prompt for many tasks.

Yash ShahApril 24, 20263 min read

A team chasing the last few percentage points of accuracy on a math-reasoning feature spent a week iterating the prompt. The prompt got more elaborate; accuracy ticked up by 1%. They tried generating 3 candidates per request and selecting the majority answer. Accuracy jumped 8%.

Self-consistency — sample N candidates, pick the most common — is a brute but effective technique for tasks where reasoning leads to a discrete answer.

The voting pattern

The pattern:

Generate N samples (typically 3-5) at non-zero temperature.
Each sample is a complete reasoning + answer.
Tally answers across samples.
Pick the most common answer (or weight by confidence if available).

Works for tasks with discrete answers — classification, math, multiple-choice, structured-extraction.

Doesn't work as well for prose generation where there's no "answer" to vote on.

Cost shape

The cost is N times a single call. For N=3, triple the cost.

Worth it when:

Accuracy gain matters more than cost.
Task has discrete answers to vote on.
Latency budget can absorb (calls can be parallel).

Failure modes

Tied votes. What's the tie-break rule? Define it.
Mode collapse. All samples produce the same wrong answer. Voting doesn't help; the underlying model is wrong.
Long-tail variance. Many distinct answers, no clear majority. The task may be too uncertain for self-consistency to fix.

Reviewer signal

Cases where self-consistency disagreed (N=3 produced 2-1 split, for example) are the team's signal:

Was the majority right?
Was the minority right?
Both wrong?

These edge cases inform prompt iteration and eval-set growth.

A real benchmark

A team's reasoning task:

Base accuracy: 78%.
N=3 self-consistency: 86%.
N=5 self-consistency: 88%.
N=10 self-consistency: 89%.

Diminishing returns past N=5. Team shipped N=3 (best cost-quality balance for their use case).

What we won't ship

Self-consistency on prose tasks without a voting mechanism that handles the prose.

N=10+ for marginal accuracy gain when N=3 is already at the goal.

Self-consistency in latency-critical paths where N parallel calls don't fit the budget.

Skipping the eval to confirm the gain. The gain might be smaller than expected on your specific task.

Close

Self-consistency is brute force that often beats clever. For tasks with discrete answers, sample N and vote. The accuracy gain is usually worth the cost. Skip the cleverness; ship the votes.

Self-consistency: when N=3 beats a smarter prompt

The voting pattern

Cost shape

Failure modes

Reviewer signal

A real benchmark

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors