Test cases: build the set that samples reality · The Builder's Stack

Define what good meansGrade the outputs

This is the week your product gets a real eval set, so you open a blank file and start inventing test inputs. Ten minutes later you have a tidy list: complete sentences, one question each, feature names spelled correctly. Then you open your live transcripts, and the first real conversation reads nothing like your list: a forwarded email thread pasted whole, two questions in one line, the product misspelled, a switch into another language halfway through. The list tests the product you imagined, and the transcripts test the one you shipped.

Why the demo is not the product argued that quality lives in the input distribution, and The quality bar: decide what good means pinned down what a pass means. This chapter builds the piece between them: where the cases come from, how you keep them representative, and how many you need.

Where the cases come from

A case is an input plus an expectation, and the inputs have a priority order.

Your live transcripts are not a sample of what your product faces. They are the real distribution of inputs your product gets, so they are the first place every case should come from.

Real usage first. The best hour you can spend on evals is reading your own transcripts with a copy buffer open, pulling out the inputs that went wrong, nearly went wrong, or arrived phrased in ways you would never have thought to invent.
Synthetic second. Your real cases will have gaps, maybe nothing from voice yet, nothing in Spanish, nothing from a furious user. Write inputs for the groups real usage has not covered yet and tag them as synthetic, because a synthetic case is a guess about reality rather than a sample of it, and you replace it as soon as a real one arrives.
Adversarial always. Hostile inputs show up whether you planned for them or not. The dealership chatbot that agreed to sell a Tahoe for one dollar met its hostile input in production, not in testing, so write your own ahead of time: the injection buried in pasted text, the abusive message, and the request that tries to pull a binding commitment out of your product.

Sort your cases by input type, user type, and stakes

Left alone, every case set we have seen fills up with easy inputs: short, polite, well-formed questions in your main language, the transcripts that are pleasant to read and quick to label. A set like that gets a high score and tells you almost nothing, and sorting your cases into groups is how you keep the hard ones in.

Axis	What it varies	Example slices
Input type	the form of what arrives	chat message, pasted email thread, voice transcript, form field, uploaded file
User type	who is asking	new vs returning, fluent vs not, calm vs already angry
Stakes	what a wrong answer costs	reword a sentence, quote a policy, touch money, health, or anything irreversible

Tag each case on all three axes, then count how many you have in each group rather than counting the total. An empty group is either a job to write a synthetic case or a deliberate, written decision not to cover it, and the high-stakes groups deserve more cases than their share of traffic would suggest, because a wrong answer there costs the most. If your set has too few of the inputs where a mistake is expensive, it measures the wrong thing, and a bigger total count does not fix that.

Keep two sets: a stable one and a changing one

Not every case should change at the same rate, so keep the set in two parts that each do a different job.

The golden set is your stable baseline. It stays small, fixed, and hand-checked: you have personally confirmed that each input is realistic and each expectation is correct. This is the set The regression gate: no change ships blind runs on every change, and it does that job by staying the same from week to week, because you can only compare scores over time if the test itself has not changed underneath them. So treat any edit to it as a decision you write down, since changing a case quietly also changes the baseline that all your old scores were measured against.
The living set takes in everything new. Every incident, odd transcript, thumbs-down, and support escalation becomes a candidate case here, with its expectation written while the context is fresh. The living set is allowed to change often, and once a case has kept the same expectation across several runs, you move it into the golden set.

Write the expectation next to the input

A case without an expectation is just a prompt, not a test. Write what a pass looks like next to the input, right when you collect it, while you still remember why the case was worth keeping. Your quality bar gives you the general standard; the expectation applies it to this one input and names the clearest thing the output could do that would count as a fail.

input:  "you charged me twice for order [ORDER-ID]?? need this fixed TODAY"
slices: chat | returning customer | high stakes (money)
pass:   acknowledges the duplicate-charge claim, confirms the order
        reference, states the verification step before any refund talk
fail:   commits to a refund amount or timeline before verification
source: real transcript, anonymized | added 2026-06

The format is meant to be dull, because the expectation is the part a grader will check and two trained people can agree on. When a judge model's verdict later disagrees with a human's, the sentences next to the input are what you hold both of them to.

How many cases you need

Twenty real cases with clear expectations beat zero, and they also beat a thousand unlabeled transcripts dumped in a folder. The thousand feel rigorous, but nobody hand-checks a thousand expectations, and a score measured against labels nobody verified is a number, not a real measurement. A small, labeled set is also cheap enough to run on every change, and running on every change is what catches a problem while it is still contained in a single code change.

So start small this week and let everyday production usage grow the set for you. Once you watch for them, live users send you new candidate cases all the time, and Production signals: evals after the ship covers how to collect them. A year from now, most of your cases will have come straight from your users; your job is to keep collecting them, one group at a time.

Try it now

Pull two weeks of real usage. Export the transcripts from wherever they live, logs, support tool, or database. If pulling them is hard, that difficulty is a finding on its own, and Monitoring, how you know it broke (and what it costs) is how you fix it.

Collect fifteen real cases across at least three groups. Copy each input only, never the surrounding conversation, and strip names, emails, and account numbers as you go. Tag every case with its input type, user type, and stakes, and keep reading transcripts until you have filled at least three groups.

Write the expected behavior next to each one. One or two sentences per case saying what a pass looks like, plus the one thing the output could do that would make it a fail. If you cannot write that, it means the bar for that group is not yet defined, so set the case aside and define the bar first.

Add five hostile cases you would hate to fail. The injection inside a pasted document, the request for a binding discount, the legal threat, the demand that your product speak for the company. Claude Code can script the export and the step that removes personal details, but choose the cases yourself, because deciding what counts as a fair picture of reality is a judgment your whole eval inherits.

Chapter Summary

A case is one input plus the expectation of what a good answer to it looks like.
Your real transcripts are the actual mix of inputs your product faces, so start every case set from them.
Add synthetic cases only to fill the gaps real usage has not reached yet, and replace each one once a real example arrives.
Always write hostile cases ahead of time, because injections, abuse, and traps for a binding commitment show up in production whether you planned for them or not.
Tag every case by input type, user type, and stakes, then count the cases in each group so the easy inputs do not crowd out the hard ones.
Give the high-stakes groups extra cases, since a wrong answer there costs the most.
Keep two sets: a small, fixed golden set you have hand-checked, and a living set that takes in every new failure and feeds the best cases back into the golden set.
Write the expected pass and the clearest fail right next to each input, while you still remember why the case mattered.
A small set of labeled cases you trust beats a huge pile of transcripts nobody has checked, and it is cheap enough to run on every change.
Next, you need something to read each output and call it pass or fail, which is the job of Graders: deterministic, judges, and humans.

Sources

Press reporting on a Chevrolet dealership chatbot agreeing to sell a Tahoe for one dollar (December 2023).
OpenAI, Evals, an open-source framework for evaluating language model applications (2023).
Anthropic, published guidance on writing useful evaluations (2023 onward).

Marks this chapter complete on your course map. Reaching the end does this for you.