Skip to content
AI-Native PM
8 min · 0 of 8 in Evals

Graders: deterministic, judges, and humans

You changed one line of the prompt, the case set you built in Cases: build the set that samples reality is loaded, and the only thing between you and a verdict is grading the run. The first output gets a careful read; around the thirtieth the summaries blur; by the fiftieth you are skimming for keywords, and when you check this run against last week's, the same unchanged output is marked pass then and fail now. Nothing in the product changed between those two verdicts. What changed was the grader, because the grader was a tired person whose standard drifts as the hours wear on.

Grading is measurement, so it needs instruments steadier than your attention at the end of a long day.

Code, a second model, and your own team can each hold that instrument, and they stack into a ladder ordered by cost and by the work each rung needs before you can trust its verdicts.

Let code grade everything code can grade

The bottom rung is the deterministic check, a few lines of code that inspect an output and return the same verdict every time. It costs nothing per run, holds the four hundredth output to the same standard as the first, and never gets bored. Builders skip this rung because it sounds too simple to matter, yet it routinely catches more real failures than anything above it.

  • Schema-valid. The output parses as the structure your code expects, with every required field present. An output your parser rejects fails before quality is even a question.
  • Contains and never-contains. The refund reply names the actual policy window; the support answer never contains a competitor's name, a payout promise, or dosage advice. Banned content lives here as a list of patterns that mean automatic failure.
  • Length bounds. The notification fits the lock screen it ships to, and the summary is shorter than the document it summarizes.
  • Exact match where exactness is the job. Classification, extraction, and routing have one right answer per case, so the grader is a string comparison and nothing more.

Walk the bar you wrote in The quality bar: decide what good means and translate every mechanically checkable "must" into one of these. What survives that pass, tone, completeness, faithfulness to a source, is genuinely judgmental and belongs to the next rung.

A model judge needs a rubric to be worth anything

For the qualities code cannot reach, a second model can sit in the grader's seat and score thousands of outputs in minutes. Asked to rate answers from one to ten with no other instruction, it returns confident noise, numbers with no stable meaning, so the rule on this rung is that the judge always has a rubric to work from. The real grader is the rubric plus the worked examples, and the model is only the engine that applies them. Hand it the exact rubric from your quality bar along with a graded pass, fail, and borderline case with the reasoning written out, then require a verdict format your code can parse.

Once it is running, treat the judge as an instrument with known error, because studies of judge models report a few consistent biases.

  • Position. In side-by-side comparisons, verdicts tilt toward whichever answer appears first.
  • Verbosity. Longer answers outscore shorter ones of equal quality.
  • Self-enhancement. Text resembling the judge model's own writing scores higher, which matters when the same model family produces and grades.

The same research supplies the mitigations. Run pairwise comparisons in both orders and keep only verdicts that survive the swap, since a verdict that flips with position is bias, not signal. Anchor every scale to worked examples so the judge grades against your definition of good rather than generic polish. And read a small sample of its verdicts against the transcripts every run, logging where you disagree.

Keep human grading small and on a schedule

The top rung is you and whoever builds the product with you, grading by hand. It is the only rung that can notice the bar itself is wrong, which makes it the layer that keeps the other two honest, and it is the scarcest instrument you own. The wrong way to use it is a heroic weekend of grading hundreds of transcripts after something breaks. The right way is small and scheduled.

  • A fixed slot. A recurring hour, weekly or close to it, that survives busy weeks.
  • A fixed sample. A few dozen recent outputs, graded blind before anyone looks at what the lower rungs said about the same items.
  • A fixed comparison. Disagreements between your verdicts and the machine rungs become the work list; each one is either a grader bug or a rubric gap.

When two trained people grade the same output and disagree, the instinct is to argue about the output. Resist it, because two calibrated raters disagreeing means the rubric is underspecified. Repair the instrument instead: write the missing clause, add the disputed output as a worked example, and regrade. Formal statistics exist for scoring agreement between raters, but at product-team scale it is enough to treat every disagreement as a defect in the rubric, not in either rater.

Reliable is not the same as valid

A grader has to be reliable and it has to be valid, and neither property implies the other.

  • Reliable. The grader returns the same verdict on the same output every time.
  • Valid. The verdict tracks the quality you actually care about.

Deterministic checks are perfectly reliable and only as valid as the property they inspect, so a length check never wavers while saying nothing about whether the answer is true. A judge can be reasonably reliable after anchoring and order swaps and still be invalid if its rubric rewards the wrong thing. A reliable but invalid grader is the more dangerous of the two, because a noisy grader at least looks noisy, while a steady one gives you confidence in a precise measurement of the wrong thing. So every rung earns trust by being checked against the rung above it, not by quietly agreeing with its own past verdicts. When your graders eventually run across an agent fleet, Verification: make the fleet check its own work in The Frontier carries this ladder into orchestration.

Try it now

Pull ten cases from your eval set with the outputs your product produced for them, and grade the same ten three ways in about half an hour.

Write the deterministic check. Take the most mechanical "must" in your quality bar, a required field, a banned phrase, a length bound, and have Claude Code write a script that runs it across all ten outputs.

Run a rubric-anchored judge. Give Claude Code the rubric and worked examples, then have it grade the same ten outputs with a verdict and a one-line reason for each.

Grade them yourself, blind. Before reading either instrument's verdicts, record your own pass or fail and the reason for each output.

Reconcile the three columns. Where all three agree, move on. Where they disagree, decide which instrument was wrong rather than which verdict feels right, then fix that instrument: tighten the check, promote the disputed output into the rubric as a worked example, or accept that your own read drifted and update the bar. Every repair stays in the eval.

Chapter Summary

  • Grading is measurement, so a verdict is only as good as the instrument behind it, and a tired person is the least steady instrument you have.
  • The graders form a ladder by cost: cheap deterministic code at the bottom, a model judge in the middle, and your own team at the top.
  • Give every mechanical check to code, since it returns the same verdict every time, costs nothing per run, and never gets bored.
  • A model judge handles the qualities code cannot reach, but only when it grades against a written rubric with worked examples, never a bare one-to-ten score.
  • Treat the judge as an instrument with known biases: run comparisons in both orders and keep only verdicts that survive the swap, and read a few of its verdicts against the transcripts every run.
  • Keep human grading small and on a schedule: a fixed weekly slot, a fixed sample, graded blind, with disagreements becoming the work list.
  • When two careful graders disagree, the rubric is underspecified, so fix the words and add the disputed output as an example instead of arguing over who was right.
  • A grader has to be both reliable and valid, and a reliable but invalid grader is the worst kind, because it measures the wrong thing with steady confidence.
  • Next up, once your verdicts come from instruments instead of moods, you put them in front of every change you ship in The regression gate: no change ships blind.

Sources

  • Zheng, L., et al. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems.
  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1).
Marks this chapter complete on your course map. Reaching the end does this for you.