Your support assistant looked great in the demo. It pulled up an order, quoted the refund policy in the right tone, and turned a messy ticket thread into a tidy summary. Whoever drove the demo typed each question, every answer came back clean, and the room agreed to ship. Weeks after launch, a customer pastes in a furious, rambling email containing order numbers from a live account and a cancelled one, and asks whether the discount in an attached screenshot still applies. The reply is warm, fluent, and wrong, and the customer posts both halves of the conversation.
Nothing in that demo was faked, but a sample only answers the question it was drawn to answer. A demo samples the happy path you already believed in. Production samples the whole distribution: the ordinary inputs, the malformed ones, and the adversarial tail.
An eval is how you sample production before production samples you, and this part teaches you to build one.
What a demo actually measures
Every input in a demo was chosen by somebody, and that somebody usually built the feature. During development you ran the system informally over and over, rephrasing and retrying whenever an output disappointed you. That loop trained you on which phrasings the system handles, so by demo day you steer around the weak spots without noticing you are steering. The demo measures the imagination and the trained reflexes of its author, not the product.
Production draws from a far wider pool with structure worth naming.
- Ordinary inputs. Typos, missing context, several questions in one message, a pasted email thread, the same job phrased every way imaginable. This is most of your volume and where quality quietly erodes.
- Malformed inputs. An empty message, the wrong language, an entire report pasted into a box built for a sentence, JSON where prose was expected. These are not hostile, only unfiltered.
- The adversarial tail. A small number of users who treat your text box as a puzzle to win, probing for whatever your product can be made to say or do.
The tail arrives uninvited. In December 2023, a visitor to a Chevrolet dealership's website prompted its chat assistant into agreeing to sell a new Tahoe for one dollar. The reply described the deal as a legally binding offer. Whatever the dealership tested before launch, it was not that conversation, and the screenshots traveled much further than the feature ever did. A demo never includes that customer, because the person writing a demo is not trying to lose.
What an eval actually is
The word eval carries a research aura it has not earned. Underneath sits a plain procedure.
- A question worth asking. The question is not "is the model good" but "does the product do the job you sold." Does the assistant resolve a billing question without inventing a policy? Does the summary keep every number intact? Measurement practice calls this validity, whether you are measuring the right thing, and it is the part teams skip most often.
- Cases. Inputs paired with a known expectation of a correct outcome, drawn from the distribution above rather than from the author's head: real logs, real tickets, the strangers and the tail included.
- A grader. The rule that decides pass or fail, whether an exact-match check, a checklist, a model applying a rubric, or a person. Whatever form it takes, the same output must receive the same verdict every time; that property is reliability, and without it your scores are noise.
- A bar. The score that means shippable, written down before the run, so nobody bends the definition of good to fit the result.
An A/B test measures live outcomes after you have shipped to users; an eval measures before anyone is exposed. Monitoring, how you know it broke (and what it costs) likewise reports what already happened. The eval is the only instrument in your kit that runs while changing course is still cheap.
Why "it seems good" fails
Most teams shipping a first AI feature do have a quality check, and it is a person looking at outputs and nodding. The failure is specific. As a measurement, "it seems good" has no stated question, so nobody can say afterward what was tested. It has no case set, so coverage is whatever someone typed that afternoon. Its grader is one person with every incentive to pass their own work, and its bar is a feeling that forms after seeing the output, which means the bar moves. Together those gaps leave you holding a mood instead of a result.
CNET ran the no-bar version of this in public. In January 2023, readers and reporters found errors in finance articles the outlet had drafted with an AI tool, and after review it issued corrections on a substantial share of them. Every one had seemed good to someone before it went out. The eval still happened, because the eval always happens; it simply ran in production, with readers as the graders and the publication's credibility as the cost.
The path through this part
This part sits in The Practice level beside the human factors work. The thread running through every chapter is validity first, deciding you are measuring the right thing, then reliability, making the measurement consistent enough to trust.
The quality bar: decide what good means starts there, defining good for your product before anything gets scored. Cases: build the set that samples reality turns logs, tickets, and the adversarial tail into a set that earns trust. Graders: deterministic, judges, and humans covers how a verdict gets decided and what each option costs. The regression gate: no change ships blind wires the eval into your shipping habit, so a prompt tweak cannot quietly break what worked last week. Production signals: evals after the ship connects the pre-ship instrument to what real users do. Goodhart's trap: when the metric becomes the target covers how good measures go bad once people optimize for them, and the capstone, Stand up your eval and make it the bar, has you build the whole instrument for your own product.
Try it now
This drill takes about 15 minutes and runs on your own product.
Pull ten real inputs. Open your transcripts or logs and copy the ten most recent user inputs, verbatim, skipping nothing because it looks weird; the weird ones are the point. If Claude Code can reach your logs or database, ask it to extract the last ten user messages exactly as received. If you have no users yet, take the last ten inputs you typed during your own testing and proceed anyway, knowing the sample shares an author with the product.
Mark what your demo never tried. Compare each input with what your demo and day-to-day testing actually used, and mark every one whose length, messiness, or kind of ask was never exercised.
Write one sentence. State what the marked rows say about what you have actually tested versus what you have assumed. Keep the ten inputs close, because they become the first rows of your case set later in this part.
Chapter Summary
- A demo only tests the inputs its author already thought of, so it measures your imagination, not the product.
- Production sends the full range: ordinary messy inputs, malformed ones, and a small adversarial tail that arrives without warning.
- An eval is how you test the product against that full range before real users do.
- Every eval has four parts: a clear question, real cases, a grader that gives the same verdict every time, and a bar set before the run.
- "It seems good" is not a real test: it has no fixed question, no fixed cases, a grader judging their own work, and a bar that moves to fit the result.
- The eval always happens. Your only choice is whether it runs before you ship or in front of users.
- The ten inputs you just pulled are the first rows of your case set.
- Next up is The quality bar: decide what good means, because no measurement can be better than its definition of good.
Sources
- Business Insider and other press reporting on a Chevrolet dealership chatbot agreeing to sell a Tahoe for one dollar as a legally binding offer (December 2023).
- CNET editors' note and The Verge reporting on corrections to AI-drafted finance articles (January 2023).
- AERA, APA, and NCME, Standards for Educational and Psychological Testing (2014), on validity and reliability.