Skip to content
AI-Native PM
7 min · 0 of 8 in Evals

The quality bar: decide what good means

Your support bot answers a refund question, and two of you read the same reply. The refund math is right, the policy quote is accurate, and the closing paragraph apologizes three times in a register no human on your team would use with a customer. One of you marks it shippable because the facts check out; the other blocks it because no customer should receive it. This is not a quality argument; you are discovering that nobody decided what good means for this product, and any eval you build before deciding will automate the disagreement instead of resolving it.

Why the demo is not the product made the case for judging your feature against real inputs instead of a rehearsed demo. Before you collect a single case, you need the thing every case is scored against, and that thing is a product decision before it is a technical artifact. It fits on one page, needs no code, and the cases, graders, and gates in the rest of this part exist to enforce it.

The job decides what good means

An output is never good on its own. It is good only relative to the job your product was hired to do, so the bar starts with naming that job in one sentence.

  • A support bot is hired to resolve the customer's question correctly without a human in the loop.
  • A meeting summarizer is hired to produce decisions and action items someone can forward without editing.
  • A contract assistant is hired to flag the clauses that put the signer at risk before signature.

One model can sit behind all three products and be held to three different definitions of good, so no vendor or leaderboard can hand you your bar. The job sentence also settles ownership: the bar belongs to whoever owns the product, not whoever wrote the prompt. Prompts get rewritten and models get swapped; the bar has to survive both, because it is the instrument that judges the rewrite and the swap. If only the prompt's author defines success, the definition drifts toward whatever the current prompt already does well, and the eval ends up rewarding the code you have instead of testing the job.

Name the dimension your product lives on

Break "good" down into separate qualities you can judge one at a time.

  • Correct. The facts, numbers, and actions in the output are right.
  • Complete. It does the whole job: every part of the question, every required field, the one action item that mattered.
  • Safe. It stays inside your policies and the law, even when the input is hostile.
  • On-voice. It reads like your product, in register and terminology.
  • Fast. It arrives within the patience the moment allows, which differs between a chat reply and an overnight digest.

Every team wants all five, but they trade against each other: more complete usually means slower, and safer often means less complete. The discipline is to name the one load-bearing dimension, the one whose failure ends the user's relationship with the product. A support bot lives on correctness, because a wrong refund answer costs money and trust while a stiff sentence is survivable. A writing assistant lives on voice, because a draft that sounds like someone else is useless even when every fact checks out. Naming one dimension does not abandon the other four; it decides which one gets the strictest statements, the most cases, and the tie-breaking vote when they collide.

Write the bar at three levels

With the job and dimension on the page, write the bar as short statements, each tagged with one of three levels.

  • Must-never. One occurrence is an instant fail that blocks the release regardless of everything else: the reply quotes a price that is not in the catalog, reveals another customer's data, invents a policy. These statements encode the harms from your riskiest flows.
  • Must-always. Every output carries these or it is a bug: the answer cites the passage it came from, stays under the length limit, offers the human handoff where policy requires one.
  • Target. This is the level you raise over time and where quality competes: the reply resolves the issue without escalation, the draft needs no edits before sending. Track the pass rate and ratchet it upward release by release; a single target miss blocks nothing.

We keep the whole bar on one page, laid out like this.

Each statement on that page eventually becomes a check, which is the work of Graders: deterministic, judges, and humans, and if you cannot picture how you would grade a statement, that is the first sign its wording is not finished.

Write statements two readers score the same way

Two ideas from measurement run through this whole part: validity is whether you measure the right thing, and reliability is whether you measure it the same way every time. You need both, and having one does not give you the other. A length check is perfectly consistent and says nothing about whether customers got correct answers, while "the reply is helpful" aims at the job and returns a different verdict from every reader. Your job sentence and load-bearing dimension carry validity; the wording of each statement carries reliability, and reliability has a cheap test.

Hand the same outputs and spec to two careful readers. When verdicts differ, the tempting move is to debate who read the output correctly; the correct move is to treat the statement as underspecified and fix the words, not the people.

  • Underspecified: "Replies are professional." Two readers split on the same breezy reply.
  • Scorable: "Replies contain no slang or exclamation marks and address the customer by name."
  • Underspecified: "The summary is accurate."
  • Scorable: "Every decision in the summary appears in the transcript, and every action item in the transcript appears in the summary."

The scorable versions are wordier, and the extra words remove the judgment calls that made verdicts diverge. There are formal statistics for rater agreement, but at your stage it is enough to count the disagreements and treat each one as a fix to make to the spec. A spec two readers apply identically is also one a grader can enforce and a new teammate can inherit.

Public benchmarks are not your bar

When a new model tops a leaderboard, someone will ask why you keep a private bar at all. Benchmark suites like HELM score general capability across many scenarios at once. A public benchmark measures the model's general job; your eval measures your product's specific one. No public suite contains your refund policy, your voice, or the off-script inputs your users send, so a model can climb a leaderboard and still fail your must-nevers on day one. Use benchmarks to shortlist models worth trying and your own bar to decide whether any of them ships.

Try it now

This drill takes about twenty minutes and a favor from a colleague, and it produces the page this part builds on.

  • Write the job in one sentence. Finish "A user hires this product to..." for the AI feature you have in front of users. If the sentence needs two "and"s, you are writing two bars, so start with the riskier flow.
  • Name the load-bearing dimension. Pick one of correct, complete, safe, on-voice, or fast, and add a line on what its failure costs you.
  • Write five statements and tag each one. Use must-never, must-always, or target, in words a new hire could apply without asking you anything.
  • Run the two-reader test. Pull three real outputs from your logs, or generate three from real inputs. Score them against your spec, then hand both to someone who did not write it and collect verdicts: pass or fail per output, with the failing statement named.
  • Fix the words, not the people. Every disagreement marks an underspecified statement. Rewrite it and rerun until the verdicts match.

Chapter Summary

  • The quality bar is your written definition of "good" for this product. It is a product decision, not a technical one, and it fits on a single page.
  • "Good" always depends on the job your product was hired to do, so start the bar by writing that job in one sentence.
  • Break "good" into separate qualities, correct, complete, safe, on-voice, and fast, and judge each one on its own.
  • Those qualities trade off against each other, so name the one that matters most: the quality whose failure would end the user's relationship with the product.
  • Write the bar as short statements at three levels: must-never (one slip blocks the release), must-always (every output needs it), and target (the level you raise over time).
  • Reserve must-never for harms you would pull the feature over. Inflate it and a red result stops meaning "stop."
  • Write each statement so two people grade the same output the same way. When two readers disagree, fix the words, not the people.
  • Public benchmarks measure a model's general ability; only your bar measures your product's specific job.
  • Next up is Cases: build the set that samples reality, which builds the input side.

Sources

  • Liang et al., Holistic Evaluation of Language Models (HELM), Stanford Center for Research on Foundation Models (2022), on benchmarks scoring general capability across many scenarios.
  • Cohen, A Coefficient of Agreement for Nominal Scales, Educational and Psychological Measurement (1960), the formal statistic for inter-rater agreement.
  • Carmines and Zeller, Reliability and Validity Assessment, Sage Publications (1979), on validity and reliability as separate requirements.
  • Christensen, Hall, Dillon, and Duncan, Know Your Customers' "Jobs to Be Done", Harvard Business Review (2016), on defining products by the job users hire them for.
Marks this chapter complete on your course map. Reaching the end does this for you.