Building evals that actually catch regressions
Most eval suites look impressive but miss real regressions. Building evals that catch what matters requires careful dataset construction, sensitive metrics, judge calibration, and a culture of trust. The patterns from teams that get this right.
You've built an eval suite. It passes. You deploy. Users immediately complain about regressions your eval didn't catch.
This is one of the most common — and most demoralizing — patterns in production LLM work. Eval suites that look impressive but miss the regressions that matter. They give false confidence. The team trusts them, ships changes that hurt quality, and discovers the problem only when users surface it.
Building evals that actually catch regressions is harder than it looks. The basics (dataset, expected outputs, scoring) are easy. Making them reliable signal is where the work is.
This article covers what we've learned about building evals that hold up — for teams that take production AI quality seriously.
What good evals do
A good eval suite, when run on a candidate change, gives you a high-confidence answer to: "is this change better, worse, or the same as the current version?"
The contrasts:
- Catches regressions. When quality drops on real-world patterns, the eval score drops.
- Detects improvements. When quality improves, the score reflects it (not just on cherry-picked cases).
- Stable on non-changes. When nothing has changed materially, the score is consistent.
- Trustable for decisions. Decisions made based on eval scores correlate with actual user outcomes.
Each of these is harder than it sounds.
The dataset problem
Your eval is only as good as your dataset. Most failing eval suites fail here.
Common dataset failures
Failure 1: Cherry-picked easy cases. The dataset was constructed when the system was working, often by the engineer who built it. They chose cases that "made sense" — clear examples of each behavior. Real production traffic is messier. Hard cases, ambiguous cases, edge cases dominate failures but are under-represented in the dataset.
Failure 2: Stale dataset. The dataset was built six months ago. Since then, user behavior has shifted, new feature areas opened, new product capabilities released. The dataset is testing an old version of the world.
Failure 3: No coverage of regression-prone areas. The eval covers the "happy path" thoroughly but doesn't probe the boundaries where regressions actually happen.
Failure 4: Dataset contamination. Examples used in the eval also appear in the prompt or fine-tuning data. The model "remembers" them. Test scores are inflated; real performance is worse.
Failure 5: Imbalanced distribution. 80% of dataset is one type of input; 5% is the long tail. A regression in the long tail barely moves the score even though it's a real regression.
Building a strong dataset
A few principles that produce strong datasets:
Source from real production traffic. The best examples are real user inputs (with PII removed). They contain the actual messiness your system needs to handle.
A practical workflow:
- Capture a sample of production calls (with appropriate privacy controls).
- Hand-label expected outputs (or LLM-label, then human-verify).
- Add the case to the eval dataset.
- Refresh the dataset regularly (monthly is a good cadence).
Include diverse failure modes. Specifically include cases that have previously failed in production. These become regression tests — if you fix a bug, you add the failing case so it stays fixed.
Stratify the dataset. Categorize cases (easy/medium/hard, by topic, by user type, by length). Ensure each category has meaningful coverage. Track scores per category, not just overall.
Mix difficulty. Some easy cases (so you can detect catastrophic regressions). Some medium (the bulk of real traffic). Some hard (the edge cases). A dataset of only hard cases shifts wildly with small changes; a dataset of only easy cases doesn't budge for real regressions.
Size appropriately. Too small and your statistics are noisy. Too large and runs become slow and expensive. Typical sizes:
- Classification: 200-500 cases.
- Generation: 50-200 cases.
- Complex agent flows: 20-50 cases.
You can always start smaller and grow.
Refresh discipline. Schedule monthly review of the dataset. Add new failure modes. Retire stale cases. Update expected outputs if the desired behavior has changed.
The metric problem
What you measure determines what you optimize for. Picking bad metrics is the second-most-common eval failure.
Common metric failures
Failure 1: Single-number summaries hide problems. Overall accuracy at 90% can hide that one important category dropped from 95% to 70% while others improved. The average looks fine.
Fix: per-category breakdowns. Always.
Failure 2: Metric measures the wrong thing. A "correctness" metric on customer support responses might miss that the response is correct but rude. The metric doesn't capture tone.
Fix: multi-dimensional scoring. Correctness + tone + length + format separately.
Failure 3: Insensitive to severity. A wrong answer is treated the same whether it's a trivial mistake or a dangerous hallucination.
Fix: weighted scoring. Severe errors count more. Some errors count as 0 (safety violations); some as -1 (catastrophic failures, more bad than nothing).
Failure 4: Bimodal sensitivity. Score either goes from 100% to 99% (imperceptible) or 100% to 50% (catastrophic). Subtle regressions don't show.
Fix: graded scoring. Each output scored on a continuous scale (1-5 or 0-1), not binary.
Failure 5: Aggregating across populations. Average across all users hides that performance on a 10% subgroup tanked.
Fix: dimensional breakdowns by relevant slices (user tier, query type, locale).
Building strong metrics
Multi-dimensional. Each output gets multiple scores. Correctness. Tone. Format. Safety. Length. Whatever matters.
Weighted. Some dimensions matter more than others. Weight them in any composite metric.
Calibrated severities. Within a dimension, distinguish minor from major errors. A factually-wrong response is worse than a slightly-off response.
Per-category metrics. Report scores broken down by relevant slices. Easy/medium/hard. Per topic. Per user type.
Trend over time. A single score isn't useful; a trend is. Plot scores over time. Detect drift.
User-aligned. The metrics should correlate with what users actually care about. If users complain about rudeness, you measure rudeness. If users complain about length, you measure length.
The judge problem
When using LLM-as-judge (an LLM scores another LLM's output), the judge is your scorer. If the judge is biased or wrong, your evals are useless.
Common judge failures
Failure 1: Length bias. LLM judges tend to prefer longer responses. They'll score a longer response higher even when it's bloated.
Failure 2: Format bias. Judges prefer structured outputs (bullets, headings) over prose, regardless of which is better for the task.
Failure 3: Self-preference. When the judge model is from the same family as the model being judged, it scores its own family's outputs higher.
Failure 4: Sycophancy. Judges agree with whatever framing they're given. Telling the judge "the previous version was bad, score the new version" causes inflated new scores.
Failure 5: Lack of grounding. Judges score on impressions rather than specific criteria. Same prompt, different scoring on different runs.
Building strong judges
Calibrate against humans. Take a sample of judge scores; have a human re-score. Where they disagree, refine the judge prompt or accept that humans are needed for that dimension.
Use different judge models. When possible, use a judge from a different model family than the system being evaluated. Reduces self-preference.
Be explicit about criteria. The judge prompt should specify exactly what good and bad look like, with examples. Vague criteria produce vague scoring.
Avoid leading framings. Don't tell the judge "score this on a scale where most outputs are good." Anchor on specific behaviors.
Anchored rubrics. Score on a defined rubric with examples for each level. "Score 5: factually accurate, specific, well-structured. Score 4: mostly accurate, may lack one detail. Score 3: ..." With anchors, the judge is consistent. Without, it drifts.
Force structured output. The judge produces structured scores (per dimension, with reasoning), not free-form text. Easier to aggregate and audit.
A strong judge prompt looks like:
You are evaluating an AI assistant's response.
User query: {query}
Assistant response: {response}
Score on the following dimensions, on a scale of 1-5:
1. Factual accuracy: Are all claims correct? (5 = all correct, 4 = mostly correct with minor issues, 3 = some incorrect, 2 = many incorrect, 1 = mostly wrong)
2. Relevance: Does the response address the user's actual question? (5 = perfectly addresses, 1 = doesn't address)
3. Completeness: Does the response contain enough information? (5 = complete, 1 = severely incomplete)
4. Tone: Is the response appropriately professional? (5 = perfect tone, 1 = inappropriate)
For each score, provide:
- The score
- A one-sentence specific reason
- The specific text in the response that supports your score
Output JSON: {"accuracy": {"score": N, "reason": "...", "evidence": "..."}, ...}Run this judge against a calibration set (human-scored examples). Adjust until the judge agrees with humans within an acceptable margin.
Designing eval suites for production
A few patterns that work in production:
Suite 1: The regression suite
A defined set of test cases (200-500) that the system must pass before any deploy. Curated from real failures, edge cases, and known-good patterns. Doesn't change often.
This is your "must not break" suite. CI integration: every PR runs it. Block merge on regression.
Suite 2: The smoke test
A small subset (10-30 cases) that runs frequently. Fast feedback during development. Catches catastrophic regressions immediately.
Integration: pre-commit hook, or quick smoke test before main eval.
Suite 3: The discovery suite
A larger, more diverse dataset (1000s of cases) sampled from real production traffic. Runs less frequently (weekly?). Looks for patterns the smaller suites might miss.
Integration: scheduled job. Results reviewed in weekly quality meetings.
Suite 4: The online suite
A sample of real production traffic, scored automatically (LLM judge) or via user signals. Catches drift that offline evals miss.
Integration: continuous, dashboard-based.
Suite 5: The user-flow suite
End-to-end tests that exercise full user flows, not just single LLM calls. For agent systems and multi-step workflows.
Integration: pre-deploy of major changes.
A mature production system has all five. Start with the regression suite; add others as capacity grows.
Cost and time discipline
Evals cost money (LLM calls) and time (running them, reviewing them).
A 500-case eval with a flagship judge might cost €5-20 per run. Run it on every PR (10/day) and you're at €50-200/day. Manageable but real.
Strategies:
Run smoke tests in PR; run full suites on merge. Saves cost on PRs that won't merge.
Cache results. If neither the prompt nor the model changed, you don't need to re-run. Cache by (prompt_version, model, dataset_version).
Use cheaper judges where possible. A cheaper judge model with good calibration is often acceptable.
Parallelize. Eval runs are embarrassingly parallel. Use concurrency.
Sample rather than full run. For routine checks, sample 50 cases from a 500-case dataset. Full run on important changes.
For time, the budget is usually "how long can a PR wait?" Aim for full eval in <15 minutes. Beyond that, developers context-switch and lose productivity.
Operational patterns
A few practices that distinguish mature eval programs:
Pattern 1: Eval-gated deploys
Production deploys of LLM-affecting changes are gated on eval results. Score regressed below threshold? Deploy blocks. Engineer investigates.
The threshold is usually relative: "score must be within 2% of baseline." Allows for noise but catches material regressions.
Pattern 2: Investigation when scores move
Any meaningful score movement (up or down) gets investigated. Score went up? Why? Did we improve, or did the test get easier? Score went down? Where exactly? Is the regression contained to one slice?
Don't celebrate or panic on aggregate movements; investigate.
Pattern 3: Continuous dataset curation
The dataset isn't a one-time build. It's curated continuously. Process:
- User complains about a response → add to dataset as a regression test.
- New feature ships → add cases covering it.
- Model behavior surprises someone → if it's a real failure, add it.
- Stale cases (no longer relevant): retire.
A monthly review meeting works well. The dataset is treated as a living asset.
Pattern 4: Diff-based review
When reviewing an eval result, focus on diffs from the baseline:
- Cases where the new version scored higher than baseline (improvements).
- Cases where the new version scored lower (regressions).
- Cases where score stayed the same (no signal).
The diff is the signal. Reviewing 500 cases linearly is impractical; reviewing 30 diffs is doable.
Pattern 5: Human review of judge disagreements
Periodically (weekly?), sample cases where the judge gave a low score and a high score. Human review checks: do you agree with the judge?
Disagreements reveal:
- Judge biases (calibrate the judge prompt).
- True quality issues (fix the system).
- Dataset issues (the case's expected output is wrong).
This is how you maintain judge quality over time.
Pattern 6: Quarterly eval review
A quarterly retrospective on the eval program itself:
- What real regressions did the eval suite catch this quarter?
- What real regressions did it miss?
- What false alarms did it produce?
- What gaps in coverage do we know about?
- What's the dataset's coverage of current production behaviors?
This is meta-eval: evaluating the evaluation. Without it, the eval program degrades.
A worked example: customer support eval
To make this concrete, here's a production-grade eval setup for an AI customer support response system.
Goal: ensure the AI's responses to customer queries are accurate, helpful, on-brand, and safe.
Datasets:
- Regression suite (300 cases): - 50 easy cases (clear policies, simple answers). - 100 medium cases (typical complexity). - 100 hard cases (ambiguous, sensitive, multi-part). - 50 known-failure cases (previously regressed scenarios).
- Discovery suite (1000 cases): Sampled monthly from real production traffic, PII-scrubbed.
- Adversarial suite (50 cases): Specifically crafted prompts attempting to extract info, get refunds for non-eligible cases, manipulate the AI.
Metrics:
For each case, judge scores on:
- Factual accuracy (1-5)
- Policy compliance (1-5)
- Tone appropriateness (1-5)
- Completeness (1-5)
- Length appropriateness (1-5)
- Safety (binary: pass/fail)
Judges:
- Primary judge: Claude (a different provider from the system under test, which uses GPT — picking a different provider avoids same-model self-preference bias).
- Calibrated against human reviewer on 100 reference cases.
- Re-calibration quarterly.
Aggregation:
- Average score per dimension per slice (slices: ticket type, customer tier, language).
- Pass rate on safety (must be 100%).
- Per-case weighted score (used for diffs).
Operations:
- Smoke test (30 cases) on every PR.
- Full regression suite (300 cases) on PR merge.
- Discovery suite (1000 cases) weekly.
- Adversarial suite (50 cases) before any prompt or model change.
- Online sampling of 1% of production traffic, judged in real-time.
Review:
- Monthly meeting: review eval trends, new failure modes, dataset updates.
- Quarterly meeting: meta-eval, judge calibration, dataset audit.
Outcomes (from real deployments of similar systems):
- ~3 regression catches per month that would have shipped without evals.
- ~1 false alarm per month (eval flags a regression that's actually fine).
- Drift detected within days rather than weeks/months.
- Confidence in shipping prompt and model changes increased.
This is what production-grade evals look like. Not a quick weekend project; an ongoing investment with real ROI.
Common pitfalls
A few patterns we see repeatedly:
Pitfall 1: Building evals after the product ships. "We'll add evals later." Later never comes. Build them from day one.
Pitfall 2: Single-engineer ownership. One person builds the eval; nobody else maintains it. When they leave, the eval rots. Distribute ownership.
Pitfall 3: Treating evals as static. Built once, never updated. Becomes useless as the product evolves. Treat the eval as a living asset.
Pitfall 4: Trusting eval scores blindly. The eval said it's better, so ship it. Without human spot-check, you ship things the eval rated highly that users hate. Pair evals with human review on important changes.
Pitfall 5: Optimizing to the eval. Tuning prompts specifically to score well on the eval. The eval improves; reality doesn't. Watch out for this — if eval improvements aren't matched by production improvements, you're optimizing to the test.
Pitfall 6: Ignoring infrastructure cost. Evals at scale are expensive (LLM calls add up). Without cost tracking, you find out at the end of the month.
Pitfall 7: No clear pass/fail criteria. "Score went from 4.2 to 4.0 — is that a regression?" Define thresholds in advance. Stick to them.
Pitfall 8: Eval suite dominates testing energy. Building elaborate evals while basic correctness tests are missing. Evals are for quality drift; not all testing is evals.
The cultural part
The hardest part of production evals is cultural. Engineers and product people need to:
- Trust the evals enough to gate deploys on them. Without this, evals are theater.
- Distrust the evals enough to investigate movements. Blind trust leads to optimizing-to-the-test.
- Invest in dataset curation as ongoing work. Not a one-time project.
- Accept that evals don't replace human judgment. They reduce the surface area for humans to review.
- Recognize when an eval suite isn't working. When real regressions slip through, the eval suite is the problem, not the user complaining.
Teams that have this culture ship faster and more reliably. Teams that don't, eventually find themselves either over-relying on broken evals or paralyzed by the absence of them.
The takeaway
Building evals that actually catch regressions is harder than it looks but tractable.
The keys:
- Real, diverse datasets sourced from production reality.
- Multi-dimensional, slice-aware metrics that align with user experience.
- Calibrated judges that you've actually checked against human judgment.
- Multiple eval suites for different concerns (regression, smoke, discovery, online, end-to-end).
- Operational discipline: eval-gated deploys, continuous dataset curation, investigation of movements.
- Cultural commitment to using and improving the evals over time.
Done well, evals become the most valuable infrastructure in your AI stack. They're how you ship quickly while maintaining quality. Without them, you're flying blind.
Build the evals. Trust the evals. Improve the evals. That's how production AI quality gets sustained.