The regression gate: no change ships blind · The Builder's Stack

Grade the outputsSignals after you ship

A customer complained that your assistant's refund answers ramble, so one afternoon you add one sentence to the system prompt about keeping replies short, rerun the complaint that started it, and read an answer half the length and twice as clear. It is visibly better, and every instinct says merge it. What no single output can show you is the case further down your set, where the assistant must quote a refund deadline in full and the same instruction now trims the sentence that made the answer correct. The example proves the change can help, and only the set can show what the same change costs.

By this point in the part you own the instrument. The quality bar: decide what good means gave you a written definition of good, Cases: build the set that samples reality filled it with rows drawn from production, and Graders: deterministic, judges, and humans made the verdicts repeatable. The set pays off once you turn it into a ship rule, and that rule is the regression gate.

Run the set before any change can merge

Run your eval set on any change that could shift how the product behaves, and run it before the change merges, not after it ships. The bar is the score your current version already gets, and a new change has to match it or beat it.

So a change that scores below that bar does not ship, however good it looked in the one example that inspired it.

What counts as a behavior-shifting change is broader than instinct suggests.

A prompt edit. Even the smallest kind counts: the tone request, the added "be concise."
A model swap or upgrade. The swap counts whether you chose it or your provider shipped it for you.
A sampling change. Temperature and its neighbors widen or narrow the spread of outputs, and that spread is your product's behavior.
A retrieval tweak. Chunking, ranking, and how many passages reach the context all qualify, because the model answers from whatever retrieval hands it.
A tool description. Those sentences determine when each tool gets called, so rewording one redistributes actions across every conversation.

Each one looks small in the diff but lands everywhere, because every conversation flows through the same prompt, model, and retrieval path.

The gate does not freeze the product. When a change clears the floor and raises the score, that higher score becomes the new floor, so quality climbs over time instead of drifting.

Treat a prompt change like a code change

In Review what the AI built we made the case that nothing the AI writes merges unread, because unread changes are where failures hide. Prompts deserve the same discipline and rarely get it. A prompt belongs in version control, changes through a reviewed diff, and uses the eval run as its test suite. Teams that would never merge unreviewed code will still hot-edit a prompt in a provider dashboard, because the edit looks like copywriting rather than engineering. A code change usually touches one path through your product, while a system prompt change touches every conversation at once, so a casual prompt edit is the single highest-leverage unreviewed change you can make.

Your provider can change the model without warning

Your product can change even when your code does not, because hosted models get updated. If your code calls a moving alias rather than a pinned version, an upgrade arrives whenever your provider ships one, with no diff anywhere to warn you.

Pin the version where the stakes justify it. Call a dated model identifier instead of a moving alias, so upgrades happen on your schedule.
Treat every upgrade as a change like any other. The new model runs the full set, and yesterday's bar still holds. Being newer, or stronger on public benchmarks, says nothing about the specific job your product does.

DPD, the parcel carrier, showed what a missing gate looks like. In January 2024 a system update went out, and a customer then prompted its delivery chatbot into swearing and into composing a poem mocking its own company. The screenshots spread across social media, and DPD disabled the AI element of the chat. A regression gate is exactly what was missing here. The update changed behavior, that change met its first hostile user in production, and the bad result played out in public. Running the adversarial rows already in your set would have caught the same result in private, at the cost of one test cycle.

Run the golden set on every change, the full set on releases

Judges and human review across the entire living set on every one-line edit would make the gate slow, and slow gates get skipped. The cadence that holds up has a fast lane and a thorough one.

The golden set, on every change. This is the small subset of must-pass rows, your top intents, your worst past failures, and your sharpest adversarial probes, graded by your cheapest reliable graders. It runs in minutes, costs next to nothing, and lives inside the merge check, making it the default path.
The full living set, on releases and upgrades. Every row you have banked runs before release candidates, major prompt rewrites, and every model upgrade, with judge rubrics and human spot-checks included.

Log each run's score beside the change that produced it. A floor only protects you if you know where it stood yesterday, and a score history turns "the product feels worse this month" into a question you can answer.

Try it now

This drill takes about half an hour and runs on your own product with the cases and graders you built earlier in the part. A ten-row starter set is enough if yours is still in progress.

Set the floor. Run the set against your product exactly as it ships today, recording a verdict per case and a total. Claude Code can be the harness, pointed at your cases file and your endpoint or prompt and asked for a per-case verdict table using your graders. That total is your floor, and it belongs in the repo next to the cases.

Make one deliberate improvement. Pick the prompt change you have been meaning to make, a tone fix or a sharper refusal rule, and apply only that change, so the comparison stays clean.

Re-run and diff the verdicts. Compare case by case, not total to total. List the rows that flipped in each direction, then find the case that got worse, because one nearly always does. An instruction that tightens one behavior tends to loosen another, and the per-case diff is the only place that trade becomes visible.

Decide the trade in writing. If the gains beat the losses and the total holds the floor, ship, and record one line on what improved, what regressed, and why the trade is right. If the total fell, revert and keep the finding as a new case to protect. Either way you have run your first gate, and your best passing score is now tomorrow's floor.

Chapter Summary

The regression gate turns your eval set from a report you read into a rule that decides what ships.
Run the set on any change that could shift how the product behaves, before it merges, not after it ships.
That includes more than prompt edits: model swaps and upgrades, sampling changes, retrieval tweaks, and tool descriptions all qualify.
Yesterday's best passing score is the floor, and a change that scores below it does not ship, no matter how good its one example looked.
A prompt change deserves the same care as a code change: keep it in version control, review the diff, and let the eval run be its test.
Your provider can update a hosted model without warning, so pin a dated model version where the stakes are high and run the full set on every upgrade.
Run the small golden set on every change so the gate stays fast, and run the full set on releases and upgrades.
Log each run's score next to the change that caused it, so "the product feels worse lately" becomes a question you can actually answer.
The gate cannot catch an input you never recorded, and production produces new ones every day, which is where Production signals: evals after the ship picks the work back up.

Sources

BBC News and Guardian reporting on DPD disabling its delivery chatbot's AI element after a system update led to swearing and a poem criticizing the company (January 2024).
OpenAI Evals, the open-sourced evaluation framework and registry (2023).
Anthropic published guidance on writing useful evaluations for LLM-based products (2023 onward).
OpenAI and Anthropic model versioning and deprecation documentation on pinned snapshots versus moving aliases (2023 onward).

Marks this chapter complete on your course map. Reaching the end does this for you.