Evals for non-engineers: know if your AI workflow is getting better or worse
Evals — systematic measurement of AI output quality — are usually treated as an engineering concern. But every team running AI workflows needs them, and the basics are accessible without code. The how-to.
Outcome: Measure whether an AI workflow is improving by using examples, rubrics, and regression checks.
Here's a pattern we see constantly. A team builds an AI workflow — content drafting, customer support classification, sales email generation, whatever. It works well in week 1. The team is delighted. They roll it out.
Three months later, something feels off. Output quality seems worse. Customers complain. Reps stop using it. Nobody knows when it changed, or why.
The cause is almost always the same: nobody was measuring. The workflow that worked in week 1 might have degraded gradually, or the underlying model changed, or the prompts drifted, or the input distribution shifted. Without measurement, you don't catch it until users complain — and by then, you've lost trust.
The fix is evals — systematic measurement of AI output quality. Evals are usually treated as an engineering concern, but every team running AI workflows needs them. And the basics are accessible without code.
This article covers what evals are, why they matter, and how to set them up for any AI workflow without being an engineer.
Evals are not reporting theatre. A useful eval creates a decision: ship, hold, roll back, or investigate. If a score cannot change what the team does, simplify the eval until it can.
What evals are (and aren't)
An eval is a way to measure how good your AI's output is, systematically, against examples you control.
The components:
- A dataset. A set of inputs (the things your AI processes).
- Expected behavior. What you want the AI to do with those inputs.
- A scoring method. How you measure if the AI did it right.
- A run/report. Process the dataset, score each output, summarize the result.
The point is to have a repeatable way to ask: "is the AI doing what I want, at the quality I expect?" — and to spot when the answer changes.
Evals are not:
- One-off testing during initial build.
- Spot-checking when something seems off.
- User feedback (which is helpful, but is reactive, slow, and biased).
- Vibes ("the output feels right").
Real evals run on a schedule, against a defined set of inputs, with consistent scoring. They give you signal even when no one complains.
Why "engineers' tools" are wrong for most teams
If you Google "LLM evals", you get articles about tools like Promptfoo, LangSmith, Braintrust, Helicone. They're great. They're also designed for engineers shipping LLM-powered products at scale.
For most teams running AI workflows — marketing, sales, ops, support — these tools are overkill. You need something simpler: a way to measure your specific workflow without learning a tooling stack.
The good news: with a spreadsheet and an LLM, you can do this. Not at the sophistication of LangSmith, but enough to catch most quality issues.
The four eval patterns
There are four common eval patterns. Each fits a different type of workflow.
Pattern 1: Exact match
Use when there's a single correct answer.
Example workflow: classifying customer support tickets into 8 categories.
Dataset: 50 tickets with their correct category. Scoring: AI's answer either matches the correct category (1 point) or doesn't (0 points). Output: % correct.
This works well for classification, extraction, simple structured outputs.
Pattern 2: Reference comparison
Use when there's a known good answer to compare against.
Example workflow: drafting product descriptions.
Dataset: 30 products with "reference" descriptions you wrote. Scoring: how close is the AI's description to the reference, on dimensions like accuracy, tone, completeness?
You can score this manually (a human reads both and scores 1-5) or with an LLM judge (next pattern). Reference comparison is the gold standard for content workflows.
Pattern 3: LLM-as-judge
Use when the output has many valid forms but quality is judgeable.
Example workflow: generating personalized sales emails.
Dataset: 30 prospect profiles. Scoring: an LLM acts as a judge — given the input and the output, score on dimensions like specificity, professionalism, length, voice match.
LLM-as-judge is powerful but requires careful prompt design. A common pattern:
You are evaluating a sales email for quality. Score it on these dimensions:
1. Specificity (1-5): Does it reference specific facts about the prospect, not generic flattery?
2. Professionalism (1-5): Does it sound like a peer rather than spam?
3. Length appropriateness (1-5): Is it concise (40-80 words)?
4. Voice match (1-5): Does it match our voice (direct, no buzzwords)?
For each dimension, give the score and a one-sentence reason.
Output JSON: {"specificity": {"score": N, "reason": "..."}, ...}
Prospect profile: [input]
Email to evaluate: [output]The judge LLM provides consistency the human reviewers can't (one judge model, one prompt, applied uniformly). Calibrate it against human judgment to make sure it agrees with you on the patterns that matter.
Pattern 4: Property check
Use when you can express what "good" means as specific testable properties.
Example workflow: generating product titles for an e-commerce store.
Dataset: 50 products. Properties to check on each output:
- Length is 30-70 characters.
- Includes the brand name.
- Includes at least one of the product's key attributes.
- Doesn't use forbidden marketing words ("amazing", "best", "revolutionary").
Each property is a yes/no test. Score = % of properties passing across all outputs.
Property checks are great for structured constraints that should always hold. They run quickly and catch specific drift.
How to set up your first eval
A practical, non-engineer-friendly setup:
Step 1: Pick the workflow
Choose one workflow. Don't try to eval everything at once. Pick the one you most worry about quality on, or the one that's most consequential.
Example: "the AI that classifies inbound customer support tickets by topic."
Step 2: Build the dataset
Create a list of 20-50 representative examples. Include:
- Easy cases (clearly category A).
- Hard cases (could be A or B).
- Edge cases (don't fit any category cleanly).
- Common variations (different phrasing of the same intent).
Capture them in a spreadsheet or Google Sheet:
| ID | Input | Expected Output | |----|-------|-----------------| | 1 | "My password isn't working" | "account-access" | | 2 | "I want to cancel my subscription" | "billing" | | 3 | "Your latest update broke my workflow" | "bug" | | ... | ... | ... |
This dataset is your eval set. It shouldn't change often — its purpose is to be a stable reference.
Step 3: Define scoring
For each example, what counts as a correct answer? Be precise.
For classification: exact match on category. For content: a 1-5 score on each of 2-4 named dimensions. For extraction: each field correct/incorrect.
Write down the scoring rubric. Stick to it.
Step 4: Run the workflow on the dataset
Run your AI workflow on each example in the dataset. Capture the output in a new column.
For classification, you can do this in a spreadsheet with a function like Google Sheets' GPT integration, or by manual paste-and-copy.
For more complex workflows, dump the inputs into a tool like Promptfoo or just run a batch job once a week.
| ID | Input | Expected | Actual | |----|-------|----------|--------| | 1 | ... | "account-access" | "account-access" | | 2 | ... | "billing" | "billing" | | 3 | ... | "bug" | "feature-request" | | ... | ... | ... | ... |
Step 5: Score
For exact match: add a column "match" with 1 if Expected = Actual, 0 otherwise. Sum it up: that's your accuracy.
For LLM-as-judge: run a judge prompt for each output. Capture scores.
For property check: run each property as a separate test. Aggregate.
Minimum viable scorecard
For a first eval, track fewer dimensions but make each one actionable.
| Dimension | Question | Pass threshold | Action if below threshold | | --- | --- | --- | --- | | Correctness | Did the workflow produce the right answer or classification? | 90% | Inspect failures before release | | Safety | Did it avoid prohibited content, unsupported claims, or risky actions? | 100% | Block release | | Format | Did it return the expected structure? | 95% | Fix prompt/schema before release | | Usefulness | Would a user reasonably accept this output? | 4/5 average | Revise examples or instructions | | Regression | Did known past failures stay fixed? | 100% | Block release |
The scorecard should name an owner and a release rule. "Below 90% correctness means product owner review" is stronger than "track correctness."
Step 6: Summarize
A summary table like:
| Eval Date | Score | Notes | |-----------|-------|-------| | 2026-05-01 | 47/50 (94%) | Baseline. 3 errors: tickets 8, 23, 41. | | 2026-05-08 | 46/50 (92%) | Stable. 4 errors. | | 2026-05-15 | 44/50 (88%) | Dropped. New errors on tickets 12, 35. |
Over time, this gives you a quality trajectory. Drops trigger investigation.
Step 7: Schedule
Run the eval on a regular schedule. Weekly is plenty for most workflows. After any change to the prompts or model, run before deploying.
A 30-minute weekly habit. Set a recurring calendar block. Don't skip.
The companion scorecard linked from this article is designed for this first weekly run.
Add a release gate
Evals matter most when they sit in front of change. For any AI workflow that touches customers, operational records, or team decisions, use a small release gate:
- Baseline. Current production workflow has a recorded score.
- Candidate. New prompt, model, tool, or workflow step is run against the same eval set.
- Comparison. The candidate must preserve safety and regression scores, and must not reduce the primary quality score beyond the agreed tolerance.
- Decision. Ship, hold, revise, or roll back. Record the reason.
- Post-release check. Re-run on a small sample of real cases after launch.
This does not need to be automated on day one. A spreadsheet with a named approver is enough if it consistently prevents unmeasured changes from going live.
What to do when scores drop
The point of evals is to catch quality decay. When it happens, you investigate.
A simple investigation:
Step 1: Identify the failing cases. What specifically went wrong?
Step 2: Look for patterns. Are the failures clustered (similar inputs)? Or scattered (different types of inputs)?
Step 3: Diagnose.
- Pattern → likely a specific weakness (prompt issue, missing knowledge).
- Scattered → likely a general quality drop (model change, drift).
Step 4: Hypothesize the cause.
- Did the underlying model change recently? Check the provider's changelog.
- Did the prompt change recently? Revert and test.
- Did the input distribution change? Look at recent real data.
- Did the dataset go stale? Refresh examples.
Step 5: Test the fix. Make one change. Re-run the eval. Did it recover?
This systematic approach beats panic and guesswork.
Building the dataset over time
Your initial dataset is a starting point. Improve it over time by:
Adding real failure cases. When a real customer/user case produces a bad output, add it to the eval set. Now it's a regression test — you'll catch this specific failure if it happens again.
Pruning stale cases. As your workflow evolves, some test cases become irrelevant. Remove them.
Expanding coverage. If you notice your eval set has 20 "account-access" tickets and 1 "billing" ticket, the eval is over-indexed. Rebalance.
Adding edge cases as you find them. New customer complaint patterns, new product features, new categories.
A good eval dataset is alive — it reflects current reality, not historical reality.
Common mistakes
A few patterns that cause eval programs to fail:
Mistake 1: Building a perfect eval before starting. A 50-example dataset with elaborate scoring is intimidating to build. A 10-example dataset with simple scoring is doable today. Start small. Iterate.
Mistake 2: Only evaluating happy paths. All-easy examples don't catch real failures. Include hard cases, edge cases, and known previously-failing cases.
Mistake 3: Eval set drift. Updating the eval set every time the workflow changes makes the score meaningless. The eval set should change rarely; the workflow can change more. The point is to measure the workflow, not the eval.
Mistake 4: Trust the LLM judge blindly. LLM judges have biases. They overweight surface features (length, format). Calibrate the judge against human judgment regularly. If you disagree with the judge, the judge prompt needs work.
Mistake 5: Score without action. Running evals weekly but never acting on the data is theater. The point is to catch and fix issues. If a drop doesn't trigger investigation, you're wasting your time.
Mistake 6: Eval just one dimension. "My eval shows 95% accuracy!" — but maybe response quality has gotten worse, or response time slower, or hallucination rate higher. Track multiple dimensions where they matter.
Tools that help (but aren't required)
If you want to graduate from spreadsheets, some accessible options:
Promptfoo. Open source, configurable through YAML, runs on your laptop or CI. Excellent for testing prompts and comparing them.
Braintrust. Hosted platform for evals with a nice UI. More expensive but powerful.
LangSmith. Specifically tied to LangChain workflows; good if you're using that ecosystem.
Helicone. Logging and analytics for LLM calls, with eval capabilities.
OpenAI Evals. Open source framework, more developer-focused.
For most non-engineer teams, a spreadsheet + ChatGPT/Claude is enough. Promptfoo is the easiest "real tool" if you want to graduate.
A 4-week eval program
For a team starting from zero, a realistic plan:
Week 1: Pick and define.
- Pick one workflow.
- Build a 20-example dataset.
- Define scoring (exact match, LLM judge, or properties).
Week 2: First baseline.
- Run the eval. Capture the baseline score.
- Identify any obvious failures.
- Don't change anything yet — just observe.
Week 3: Iterate.
- Make one change you think will improve quality.
- Re-run the eval.
- Did the score go up? Down? Same? Investigate why.
Week 4: Schedule.
- Schedule weekly runs.
- Document the eval process.
- Brief the team on what the scores mean and what triggers action.
After 4 weeks, you have a working eval. From there, expand: add more workflows to the eval program, deepen the dataset, refine the scoring.
The cultural shift
Evals require a cultural shift more than a technical one. Teams used to shipping AI workflows "because they seem to work" need to embrace measurement.
The shift involves:
Being willing to see numbers go down. Sometimes a change you were excited about hurts quality. Evals will tell you. You have to be willing to roll back.
Investing in calibration. The first month of a new eval, expect to spend time tuning — the dataset, the scoring, the prompts. It's an investment.
Building a "before/after" habit. Any non-trivial change to an AI workflow runs through the eval before going live. This becomes second nature.
Holding the line on quality. When scores drop, you fix or roll back. You don't ship with degraded quality just because deadlines.
This cultural shift is the hardest part. Once it's in place, the tooling part is easy.
The takeaway
Evals are the difference between AI workflows you can trust over time and AI workflows that gradually drift into mediocrity unnoticed.
You don't need engineers, ML expertise, or fancy tools to start. You need a workflow worth measuring, a small dataset, a scoring method, and a weekly half-hour.
Pick one workflow this week. Build a 20-example eval. Run it. Look at the result. Run it again next week. Notice the discipline this builds.
In 6 months, the teams with evals will have AI workflows that have actually improved. The teams without will have workflows that look the same as they did 6 months ago — except worse.