Production signals: evals after the ship · The Builder's Stack

Block bad changesWhen metrics get gamed

Two weeks after launch, the dashboards have nothing to report: error rate near zero, latency flat, the regression gate green all week. Then you read one transcript from late one night. A user asked whether their plan covered an upgrade, and the assistant cited a clause that does not exist. The user rephrased and got a different wrong answer, tried the exact wording from your pricing page, got a third, and left. Your infrastructure logged three completed requests and marked each one a success.

Monitoring, how you know it broke (and what it costs) warned of exactly this gap, a request returning success while answering badly, and left the proper measurement of quality to this level; this chapter keeps that promise. The regression gate: no change ships blind guards the days you change something; production signals cover the other days, when the world moves against a product you left alone.

Instrument the signals users already send

Quality signals come in two kinds, split by who does the labeling work.

Explicit signals are votes, the thumbs under an answer, the report link, the feedback box. Ship them, since every click hands you a labeled transcript at almost no cost, but most users never vote, and those who do cluster at the extremes while the middle stays silent.

Implicit signals are behavior, produced whether or not you ask; you only have to log them.

The retry. The same user asks the same question again within minutes, rephrased or padded with context, which tells you the first answer did not do the job.
The mid-answer abandon. The user leaves while the output is still arriving, so the opening lines were enough to decide the rest was not worth the wait.
The edit to an accepted draft. How much of a reply survives the user's rewrite is a quality score the user computes for you.
The escalation. The user asks for a human, or a person quietly redoes the agent's work, and each one is a real request your product handled and got wrong.

The well-known public example is GitHub's research on Copilot: its researchers studied the acceptance rate of suggestions across thousands of developers and found that this one behavioral measure tracked how productive those developers reported feeling. Nobody was asked to grade anything; the measurement came out of an action people were taking anyway, logged at scale over time.

Read transcripts on a schedule

Signals say where to look; only the transcript says what happened. So make review a standing practice rather than a response to incidents, a half hour each week: pull recent conversations, weight the sample toward what your signals flagged, keep a few unflagged ones to learn what the signals miss, and read each against the bar from The quality bar: decide what good means.

Transcripts are the most personal data you hold, so the review runs under rules.

Minimize. Read in the production tool where you can; when you must copy, take only the exchange the failure needs, never the user's whole history.
Anonymize. Names, emails, and account ids become stand-ins the moment text leaves production, before it reaches a notes file, a ticket, or a judge prompt.
Respect your own privacy page. If it promises a retention window, review copies age out on the same clock; if it never says people may read conversations to improve the product, add that sentence first. The legal basics every public build needs covers that page.

Sort failures into a handful of buckets

The first review sessions produce loose notes that say the answer made up a clause, struck an odd tone, or ignored the second question. Notes like that cannot be counted, tracked over time, or handed to a teammate, so a hundred of them give you a vague impression and nothing you can act on. The fix is a failure taxonomy, a handful of buckets. A starter set for assistant-style products:

Wrong facts. The output states something false about your product, your policies, or the world.
Missed the ask. Accurate sentences that do not do the job: the second question skipped, the deadline dropped.
Policy breach. A must-never crossed, such as an invented commitment or advice your product must not give.
Broken form. Usable content in unusable form: wrong format, length, language, or structure.
Off voice. The job done in a register you would not sign.

Force each failure into exactly one bucket, adding a new one only when several failures fit nowhere. The payoff is reliability: two reviewers should bucket the same transcript the same way, and when they do not, the definitions need sharper words, the same repair your rubrics got in Graders: deterministic, judges, and humans.

Turn every failure into a test case within the week

If you read failures but never turn them into tests, you are just keeping a diary, so the loop closes with a standing rule.

Every real production failure becomes an eval case within the week, with the input minimized and anonymized as you collect it and the expected result written while you still remember what the user needed.

File each one in the living set from Cases: build the set that samples reality. From there the gate takes over and reruns the new case on every change, so the same failure cannot ship twice without a red row announcing it.

The deadline matters. An expectation you write months later is a guess about what happened, and a backlog of unconverted failures is where we have watched this loop die. Converting a case costs only a few minutes, so there is no honest excuse to let one wait.

Track quality over time, not as a single score

Everything feeding this loop keeps changing, which is why the work never reduces to one good number.

Models change. Providers update hosted models and retire pinned versions on their own schedule, so the engine under your product changes on dates you do not choose.
Users change. Launch traffic looks nothing like month-six traffic, so the mix of inputs your cases were drawn from drifts away from the cases you have.
Content changes. Docs, policies, and the world your answers describe get rewritten, so an answer correct in March is stale by June even though nothing in your code changed.

So plot every reading: golden-set score per run, each signal's weekly rate, bucket counts per month. Look at the trend rather than any single point. A single bad day is usually noise, but a retry rate climbing gently for weeks under a green dashboard means nothing broke and the product is getting worse anyway. The chart shows you what no single day can.

Try it now

This drill takes about an hour on your live product and stands up the whole loop in miniature.

Wire one implicit signal you do not have today. Pick the one closest to your riskiest moment: retries for question answering, edit survival for anything that drafts, escalations for an agent. Have Claude Code add it as an event carrying a timestamp, a session id, and the outcome, storing no new copy of conversation content, so the signal stays cheap and your privacy page stays true.

Review twenty recent transcripts against the taxonomy. Take consecutive conversations rather than ones you remember, mark each pass or fail against your bar, and give every failure exactly one bucket, sharpening a definition whenever you hesitate. The tally is the first point on your time series.

Convert the three worst failures into eval cases. Rank them by stakes rather than embarrassment. Record each one's minimized, anonymized input and the expectation a pass must meet, in your living-set format. Run them against today's build: a fail confirms the bug is live, and a pass is still worth keeping, because a failure that happens sometimes is a failure, and either way they ride the next gate run.

Chapter Summary

The eval you built before launch only stays honest while live production keeps feeding it new evidence.
Green dashboards can hide bad answers, because a request can return success while the answer it gave was wrong.
Quality signals come in two kinds: explicit votes like thumbs and reports, and implicit behavior like retries, abandons, edits, and escalations that you only have to log.
Treat every signal as a rough proxy until you read the transcripts behind it and learn what share are real failures against your bar.
Read transcripts on a fixed schedule, weighting toward what the signals flagged, and stay inside the privacy promises your product makes.
Sort failures into a few buckets so they can be counted, and force each failure into exactly one bucket.
Turn every real failure into an eval case within the week, while you still remember what the user needed, so the same failure cannot ship twice.
Plot every reading and watch the trend, since models, users, and content all keep changing under a product you left alone.
Next is Goodhart's trap: when the metric becomes the target, which covers how to keep your score honest once it starts deciding what ships.

Sources

Ziegler, A., et al. (2022). Productivity assessment of neural code completion. ACM SIGPLAN International Symposium on Machine Programming.
Ziegler, A., et al. (2024). Measuring GitHub Copilot's impact on productivity. Communications of the ACM.
OpenAI, Evals, an open-source framework for evaluating language model applications (2023).
Anthropic, published guidance on writing useful evaluations (2023 onward).

Marks this chapter complete on your course map. Reaching the end does this for you.