Engineering

Test-data management for AI: synthetic vs. real

Test data for AI features balances realism, privacy, and reproducibility. Hybrid approaches usually win.

Yash ShahMarch 19, 20262 min read

Test data for AI features is hard. Real production data captures real complexity but has privacy and compliance issues. Synthetic data is safe but often misses what makes production data hard.

The hybrid approach usually wins.

The hybrid approach

A typical mix:

Synthetic baseline. 60-80% of the test set. Generated cases covering happy path and basic edges.
Sanitised real data. 20-30% of the test set. Real production cases with PII redacted.
Adversarial-by-hand. 5-10%. Specific attacks the team has authored.

The synthetic gives volume. The real gives realism. The adversarial closes specific gaps.

Privacy

Real data in tests requires care:

PII redacted (names, emails, phones, IDs).
Sensitive content masked (medical, financial, legal).
Aggregation to prevent re-identification.

The team needs a pipeline for sanitisation. Not "we'll be careful manually" — actual tooling.

Reproducibility

Test data should be reproducible:

Versioned in the repo or a tracked artifact store.
Same data across runs.
Updates documented.

A team that pulls fresh production data every test run has flaky tests and untraceable failures.

A real strategy

A team's test-data setup:

Synthetic generator for typical inputs (regenerated quarterly).
Sanitised production sample (refreshed monthly with new sanitisation pass).
Hand-authored adversarial cases (kept in repo).
Total: 800 cases across the three sources.

The team can re-run any test against any version of the data.

Trade-offs

Synthetic: safe, volume, may miss real complexity.
Sanitised real: realistic, has compliance overhead.
Hand-authored: targeted, slow to grow.

Each has a place.

What we won't ship

Tests with raw production data and PII.

Synthetic data that doesn't reflect real complexity.

Test data that isn't versioned.

Sanitisation done manually without tooling.

Close

Test data for AI is engineering. Synthetic for volume, sanitised real for realism, adversarial for specifics. Privacy by design. Versioning by default. The team that balances these three sources tests comprehensively without legal or operational debt.

Test-data management for AI: synthetic vs. real

The hybrid approach

Privacy

Reproducibility

A real strategy

Trade-offs

What we won't ship

Close

Related reading

Determinism harnesses for non-deterministic systems

Multi-agent orchestration: from kitchen brigade to opera

Retry strategies that don't compound errors