Engineering

Regression cohorts: catching what evals miss

Real-world regressions appear in cohorts evals don't always cover. The discipline is mining and adding.

Yash ShahMarch 26, 20263 min read

A team's eval was at 99%. Customer complaints rose anyway. Investigation: the complaints were about a category of input the eval didn't cover. The eval was clean; the production failures were on cases the eval set didn't represent.

Regression cohorts close the gap. When production produces a failure mode, that failure mode joins the eval set as a cohort. Future regressions in that area get caught.

Cohort design

A cohort is a subset of eval cases focused on a known failure pattern:

Cohort: "very long inputs."
Cohort: "industry jargon."
Cohort: "ambiguous queries."
Cohort: "specific edge case from incident #1234."

Each cohort has its own pass rate. The agent must pass all cohorts.

Trigger conditions

Cohorts emerge from:

Production incidents. A specific bug becomes a cohort.
Customer complaints. A pattern of complaints becomes a cohort.
Drift detection. A drift pattern becomes a cohort.

Each addition to the eval set is documented: why this cohort was added.

Reviewer ritual

Quarterly review:

Are all cohorts still relevant?
Are pass rates being maintained?
Are any cohorts redundant?

Cohorts can retire when the underlying issue is structurally fixed.

A real catch

A team's customer-classifier eval was at 99%. A customer-feedback session revealed users were unhappy when the classifier handled "complaint about previous interaction" inputs. The eval didn't have these cases.

The team added a cohort: 30 cases of "complaint-about-prior-interaction" inputs. The classifier failed 40% of them. The team iterated the prompt; cohort pass rate rose to 92%. Customer complaints dropped.

The cohort approach surfaced the failure pattern; the eval-driven iteration fixed it.

Coverage

Cohort coverage is the team's defence against blind spots:

Each known failure pattern → cohort.
Each high-stakes use case → cohort.
Each compliance-relevant scenario → cohort.

The team's eval set isn't a flat list; it's a structured set of cohorts.

What we won't ship

Eval set as a flat list without cohort structure.

Pass rate as the only headline metric.

Adding cohorts without retiring obsolete ones.

Skipping the post-incident cohort addition. The post-mortem isn't done until the cohort lands.

Close

Regression cohorts are the discipline of catching what evals miss. Each cohort represents a known pattern. Each addition prevents recurrence. The team's eval set evolves toward fewer blind spots. Skip cohorts and the eval is misleading.

Regression cohorts: catching what evals miss

Cohort design

Trigger conditions

Reviewer ritual

A real catch

Coverage

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors