Stand up your eval and make it the bar · The Builder's Stack

When metrics get gamedWhen AI takes action

A teammate asks whether last week's model upgrade left the product better or worse, and nobody in the room can answer, even though every piece of the answer exists. The bar from one drill is in a doc, the cases from another are in a file, a grader script sits on a branch, and a floor score is buried in a commit message. Each piece taught you something the week you built it, but a pile of pieces does not catch anything.

The bar, the cases, the graders, and the gate protect your product only once they run together as one instrument, on a schedule, against what you actually ship.

This chapter is the assembly. The whole instrument fits on one page, which ships as The Quality Bar, a fillable PDF in the artifacts library. Fill it as you read, keep the finished file beside the product, and give it a standing weekly run. The walk below fills it for a worked example, a customer-support drafting assistant that writes the reply a support agent edits and sends; your product runs through the same zones in the drill.

Write the bar and load the cases

The top zone holds the product decision from The quality bar: decide what good means, the job in one sentence, the load-bearing dimension, and tagged statements. The assistant's job sentence reads "a support agent hires this product to turn an open ticket into a reply they can send after one careful read." The load-bearing dimension is correctness, because an agent clearing a queue sends whatever looks right, and a confident wrong draft becomes a sent wrong reply. The bar becomes concrete in five statements.

Must-never: the draft offers a refund, credit, or deadline the order record does not support.
Must-never: the draft contains any detail from another customer's account.
Must-always: every factual claim names its source, the order field or the policy page it came from.
Must-always: the draft addresses every question in the customer's message, or names the one it cannot answer.
Target: the agent sends the draft unchanged or with light edits.

The case ledger beneath it carries twenty-five rows built by the sourcing rules of Cases: build the set that samples reality, each an input with its expectation and tags for input type, user type, and stakes. The golden fifteen hold still, covering the top intents (order status, duplicate charge, cancellation), the worst past failures, and the adversarial rows you would be embarrassed to fail, the pasted thread with a buried instruction, the demand for a binding discount. The living ten churn, absorbing last month's escalations and thumbs-downs, with a case promoted to golden once its expectation holds steady across runs.

Assign the graders and set the pass/fail gate

Every statement gets the cheapest grader that can be trusted with it, assigned up the ladder from Graders: deterministic, judges, and humans.

Deterministic checks take the mechanical statements. For the assistant, that means three scripts: the source block parses, the never-contains pass comes back clean (currency amounts absent from the order record, "legally binding", competitor names), and the draft fits the inbox length bound.
A model judge takes completeness and voice. It gets the rubric plus a graded pass, fail, and borderline, compares pairwise with order swapped, and returns verdicts the harness can parse. A person reads a sample every run, because an unaudited judge drifts while the dashboard stays smooth.
Humans keep a fixed weekly hour. It covers a few dozen recent drafts graded blind, then compared with the machine columns. When two trained readers disagree, the rubric is underspecified, and the fix is an edit to the spec, not an argument about the output.

The gate line wires the sheet into the change habit from The regression gate: no change ships blind. The golden fifteen run in minutes inside the merge check, on every behavior-shifting diff, prompt edits and retrieval tweaks included. The full set, judge and human spot-checks included, runs before each release and on every model upgrade, with versions pinned so upgrades arrive on your schedule. Yesterday's score is the floor, and a change that lands below it does not ship, however good the example that inspired it looked.

Feed it new cases and guard the metric you chase

The next zone sets up the loop that keeps feeding the eval real cases, the habit from Production signals: evals after the ship. The assistant's sheet names three signals to watch: the share of each draft the agent sends without edits, reopens on the same thread within days, and thumbs-downs on sent replies. A weekly half hour reads them, and one written rule keeps the set honest: any failure that reached a customer becomes a living-set case the same week, with its expectation recorded while the context is still fresh.

The bottom of the sheet pairs your headline metric with a guard against gaming it, the habit from Goodhart's trap: when the metric becomes the target. The target statement, drafts sent with light edits, is about to become a number people optimize, and Goodhart's law warns that a measure turned into a target stops measuring. Chased on its own, it drifts toward short, vague drafts that are easy to send and resolve nothing. So the sheet pairs it with the reopen rate as a counter-metric, and the two travel together in every report, because a send rate that climbs while reopens climb is gaming, not quality.

The first run usually finds a real bug

Expect the first end-to-end run to pay for the work right away. In the worked example, the first pass fails three golden cases on one statement: drafts replying to pasted threads have stopped carrying their source lines, a regression that traces back to a concise-tone prompt edit shipped weeks earlier on the strength of one happy example. No demo ever pasted a thread, so no demo could have caught it, which is exactly the blind spot Why the demo is not the product opened this part with.

Try it now

The stand-up is the drill, so budget about an hour if earlier drills left you their pieces, longer if you are filling gaps, and run it on the AI feature you ship today.

Fill the sheet. Work top to bottom: job sentence, load-bearing dimension, five tagged statements, the case ledger with slices and expectations, grader assignments, gate cadence, the production signals you read each week, and the metric you chase paired with its counter-metric. Every blank you cannot fill is work the part already assigned, so do it now rather than route around it.

Run the set end to end. Claude Code is the harness: point it at the cases file, the deterministic checks, the judge rubric with its worked examples, and your product's endpoint or prompt, then ask for a per-case verdict table with a one-line reason per row. Grade five outputs yourself, blind, before reading its columns, so the human rung is present from the first run.

Record the floor. Write the total on the sheet's gate line and commit it beside the cases. That number is what tomorrow's prompt tweak and next quarter's model upgrade must clear.

Write down the first catch. Somewhere in the verdicts is a failure your demo never would have shown you: a slice that quietly underperforms, a must-always that does not always hold, a judge verdict you disagree with. Name it on the sheet, turn it into a protected case or a rubric example, and put the weekly run on the calendar before you close the file.

Chapter Summary

This chapter assembles the four pieces you built in earlier drills into one working eval on a single page.
The bar, the cases, the graders, and the gate only protect the product once they run together, on a schedule, against what you ship.
The top of the sheet holds the product decision: the job in one sentence, the load-bearing dimension, and the tagged statements.
The case ledger holds a stable golden set plus a churning living set that absorbs each month's escalations and thumbs-downs.
Give every statement the cheapest grader that can be trusted with it: scripts for mechanical checks, a model judge for completeness and voice, and a fixed weekly human hour.
Wire in the gate so yesterday's score is the floor, and a change that lands below it does not ship, however good the example that inspired it looked.
Keep feeding the eval real cases, and pair the metric you chase with a counter-metric so a rising number cannot hide gaming.
Expect the first run to find a real bug your demo never would have shown you, then turn that failure into a protected case.
A bar you file and forget describes a product that no longer exists, so the only sheet that protects anything carries a fresh tick in its weekly run. The condensed rules live in the Playbook, and the sheet itself waits with the course's other tools in the artifacts library.
This eval judges the answers your product gives; its companion review of the experience around them is Run the human factors audit, worth the same standing weekly slot.

Sources

Zheng, L., et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems.
Goodhart, C. A. E. (1975). Problems of monetary management: the UK experience. Papers in Monetary Economics, Reserve Bank of Australia.
Strathern, M. (1997). "Improving ratings": audit in the British university system. European Review, 5(3).
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1).
Anthropic, published guidance on writing useful evaluations (2023 onward).

Marks this chapter complete on your course map. Reaching the end does this for you.