The dashboard says your assistant has never been better. After weeks of prompt tuning, the judge's helpfulness score has climbed with every release, and this morning's run is the best on record. The same day, a cancellation email arrives from an early customer: the assistant used to answer in a sentence or two, and now every reply opens by restating the question, wades through caveats, and closes with an offer to elaborate. The transcripts confirm it: replies have run longer every week, the sentence the user needed sits deeper in each one, and the score you tuned against has been rising on exactly the padding that drove them away.
Economists named this failure long before AI products inherited it.
Goodhart's law: when a measure becomes a target, it stops being a good measure.
Your score was an honest signal while it only observed; the moment it started deciding what ships, it came under a pressure that observation never faces. No eval is exempt, including the one this part has you building and the ones we run ourselves.
Why a good measure goes bad
Nobody on your team cheated; the selection loop produced the failure on its own. The regression gate: no change ships blind gave your eval the deciding vote on every change, exactly as it should. From then on, every edit that raises the score survives and every edit that lowers it dies, whether or not users benefit. Iterate honestly under that rule for a month and the product drifts toward whatever the measure rewards, blind spots and biases included.
How the trap arrives in an AI product
-
Tuning for the judge instead of the user. Graders: deterministic, judges, and humans warned that judge models carry verbosity bias, scoring longer answers higher than equally correct shorter ones. Let that judge arbitrate a month of prompt changes and the bias becomes a direction: padded variants score a touch better, winners keep their padding, and the cancellation email's verbosity inflation arrives one approved diff at a time.
-
Overfitting the golden set. Every red case gets a targeted fix. After enough cycles the fixes aim at those exact inputs rather than the behavior they were sampled to represent, while production moves on: new phrasings, new user types, new jobs. The set you built in Cases: build the set that samples reality was a sample of reality on the day you harvested it; treated as a list of puzzles to clear, it hardens into a curriculum your pipeline is hand-fitted to, and a perfect score there says the product passes the test, nothing more.
-
The industry-scale cautionary tale. On public benchmarks, test questions leak into training data, so a model can score well on items it was effectively trained on, and results stop separating capability from contamination. The best-known leaderboards saturate, with scores bunched near the ceiling, and a benchmark that no longer discriminates mostly rewards being tuned for itself. The more weight a number bears, the harder it gets farmed, and a private eval earns no exemption by being private.
What you don't measure is where it breaks
In February 2024, Google paused Gemini's image generation of people after widely shared historically inaccurate depictions. A launch that size sits behind more evaluation than most teams will ever run, which is what makes the lesson portable: the failure evidently arrived on a dimension without a blocking test in front of it. Optimizing pushes all the effort onto the dimensions you score, and the ones you don't score drift to whatever the tuning happens to produce.
So the trap has a second form: a wall of green checks looks like proof the product is fine, when all it proves is that the product passes the checks you happened to build. Whatever you leave unmeasured is exactly where the next failure shows up.
Defenses that keep the measure honest
None of these need new tooling; each costs you the comfort of a single number that only goes up.
-
Hold out cases the optimization never sees. Seal a slice of your cases away from the tuning loop: never consulted for a prompt change, never opened in a debugging session, run only on release candidates. The reading that matters is the gap: a working set that climbs while the hold-out stays flat says you were fitting the set, not improving the product.
-
Rotate so the set keeps surprising you. On a schedule, retire cases the product has passed for months and promote fresh harvests from the production loop you set up in Production signals: evals after the ship. Rotation costs some week-to-week comparability, so make it a recorded, versioned change and expect the score to step when the set turns over.
-
Pair every metric with a counter-metric. For each statement you optimize, add the statement its gaming would damage, and track them together: helpful and brief, thorough and fast, resolves the issue and invents no policy doing it. A gamed metric rises alone while a genuine improvement lifts the pair, turning the likeliest exploit into a visible regression.
-
Schedule a north-star read that no metric mediates. Block a recurring hour to read raw production transcripts end to end, unfiltered by failure flags and unscored by any rubric, and write down whether you would stand behind each exchange. Every other reading arrives through an instrument that can be gamed; this one comes straight off the product, and it catches verbosity inflation, tone drift, and the unmeasured dimension while they are still cheap to fix.
The gauge can be reliable and still meaningless
Run the trap through this part's vocabulary and the diagnosis becomes precise. Nothing about Goodhart breaks reliability: the judge returns the same verdict on the same output, and the instrument stays in perfect working order. What breaks is validity. The number no longer tracks the job you wrote down in The quality bar: decide what good means; the gauge reads beautifully and means nothing. Validity is not something you settle once at design time. It erodes under the same optimization pressure the eval exists to steer. An eval is not an artifact you finish but an instrument you maintain, and the defenses above are the maintenance schedule.
Try it now
The drill takes about fifteen minutes and runs on the quality bar you wrote for your product.
Find the most gameable statement. Reread your bar one statement at a time, asking of each: if a system optimized this statement with no regard for the user, what is the cheapest way to satisfy it? Pick the statement with the cheapest exploit. For a sharper adversary, paste the bar into Claude Code and ask it to propose an exploit for every statement.
Write the exploit in one sentence. State how the statement gets satisfied while the user loses. "Cites a source for every claim" is satisfied by citing sources that do not support the claims; "answers the whole question" is satisfied at several times the needed length.
Add the counter-metric to the bar. Write the statement the exploit would trip and add it at the same level, keeping the pair adjacent so they are read together. If you cannot write one, the original statement is probably too vague to grade, which is a bar bug to fix first.
Chapter Summary
- Goodhart's law: once a measure becomes the target you optimize, it stops being a good measure.
- The danger starts the moment your eval gets the deciding vote, because every change that raises the score now survives, whether or not users benefit.
- It shows up three ways: tuning to please the judge model, overfitting your fixed case set, and, at industry scale, public benchmarks getting gamed and saturating.
- Optimizing pushes all the effort onto the qualities you score, so the ones you don't score drift wherever the tuning takes them.
- A wall of green checks only proves the product passes the checks you built. Whatever you never measured is where the next failure shows up.
- Defend the measure with four habits: hold out cases the tuning never sees, rotate the set over time, pair each metric with a counter-metric, and regularly read raw transcripts with no score in between.
- In this part's terms, Goodhart is a validity failure with perfect reliability: the number stays consistent, it just no longer tracks the job.
- An eval is not something you finish; it is an instrument you keep maintaining.
- Next up is Stand up your eval and make it the bar, which assembles every piece into the instrument your product ships against.
Sources
- Goodhart, C. A. E. (1975). Problems of monetary management: the UK experience. Papers in Monetary Economics, Reserve Bank of Australia.
- Strathern, M. (1997). "Improving ratings": audit in the British university system. European Review, 5(3).
- Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems.
- Research and press reporting on benchmark test-set contamination and leaderboard saturation (2023 to 2024).
- Google product updates and press reporting on the pause of Gemini's image generation of people (February 2024).
- Carmines, E. G., & Zeller, R. A. (1979). Reliability and Validity Assessment. Sage Publications.