Agent evals: judge the trajectory, not just the answer · The Builder's Stack

Recover from failuresWrite the Agent Charter

The promotion review for your booking agent opens on the eval dashboard, and the news is good: the outcome grader confirms each booking exists, matches the request, and sits inside policy, and the pass rate has held at the bar for a month.

Then someone opens one recorded run. The booking is there, but the path behind it ran the same availability search four times, pulled the full customer profile to read one preference field, and called the payment tool before the confirmation step, where a validation error bounced it. The outcome grader saw only the booking; the trace holds the rest. That bounced payment call passed by luck, and it is a failure waiting to land the day that validation check moves.

Grade every run twice: the result and the path

Everything the Evals part built carries over: The quality bar: decide what good means still defines a pass, graders still return repeatable verdicts, and the gate still decides what ships. What changes is the thing you grade. A product that answers produces one output you grade on its own, while a product that acts produces a trajectory.

A trajectory is the recorded sequence of tool calls, arguments, and results the agent took from the goal to the final state, and it gets two separate verdicts: one for the result and one for the path.

The outcome eval asks whether the job got done, so it checks that the booking exists and the reply matches what the user asked for. The process eval asks whether the path was sane: the right tools, no wasted calls, no loops, no policy line crossed. The two verdicts can disagree, and the booking run above is the proof.

Write the rubric that scores the path

This applies the quality bar's discipline to the agent's path, written so a trace can be scored against it.

Right tools. The run used the tools this job needs and no others. A scheduling job that touched the refund tool is a finding no matter how the run ended.
No wasted calls. Decide how many calls a competent run needs, add some margin, and treat anything past that as a finding, whether it is the same search repeated or a whole table fetched to read one row. The waste costs money, latency, and exposure.
No loops. A run fails when it repeats the same call with the same arguments back to back, cycles without making progress, or stops only because the step budget ran out. A timeout is not a completion.
No lines crossed. The run respects the hard rules this part already set: no message sent to an address that was not on the thread, no write before the read that verifies it, no retry of an action you cannot undo. Crossing one fails the run, whatever the outcome grader says.

Most of these checks run deterministically as counts, allowlists, and ordering checks over a structured trace, cheap enough to run on every change. Graders: deterministic, judges, and humans keeps its judge for the rest, the plan-quality questions counting cannot settle, calibrated the same way you already did.

Turn recorded runs into test cases

Where cases come from has not changed.

Real runs first. The ledger and traces from Receipts and recovery: design for the failed run are where you start. Each incident, near miss, and odd trace becomes a case, with the starting state and goal as the input and the path you expected as the standard, such as staying within the call budget and never touching the payment tool.
Scripted runs for the gaps. Write the trajectories your traffic has not produced yet, like the ambiguous goal, the mid-run reversal, and the upstream timeout, and tag them to retire once the real thing shows up.
Adversarial always. Add the run where a hostile page in the fetch path tries to steer the agent, since Injection: the input is the attack surface is the failure where the result looks fine while the path crossed a line.

Rerun the whole set on every tool change

The regression gate: no change ships blind already counted tool descriptions among the changes that shift behavior, and an agent lengthens that list, because behavior now includes which tools get called and in what order.

A tool description edit. The model picks tools from the words in the description, so a reworded sentence redistributes calls across every job that can reach the tool.
A new tool. Adding one changes the choice for every existing tool, and removing one does the same in reverse.
A model upgrade. Outcomes can hold steady while paths shift to different tools, longer exploration, and different retries, so an upgrade reruns both verdicts.

The floor is yesterday's pass rate on outcomes and paths together, and nothing below it merges.

Run evals against sandboxed tools, never production

An eval that grades answers only reads text, so it can run anywhere. An agent eval calls tools by design, so pointed at production it books real rooms, fires real webhooks, and mails real customers, and an eval that sends real email is not a test but an incident. So you run the set against a sandboxed copy of the tools instead: reads come from fixtures or staging copies, writes land in mocks that record the attempted call without performing it, and graders check the attempt, its recipient, amount, and order.

Sandboxing is also what lets you replay a trajectory case, since a run replays faithfully only from a starting state you can restore, inside the environment separation from Blast radius: bound what one turn can touch.

Two public benchmarks worth borrowing from: SWE-bench and tau-bench

These two sit at opposite ends and show both kinds of grading. SWE-bench scores the outcome: agents work real issues from public software repositories, graded only on whether the change resolves the issue, with the path left open. tau-bench scores the policy: multi-turn jobs with a simulated user, graded on finishing while following the domain's rules, so an agent that satisfies the user by breaking policy still fails. Most shipped agents sit closer to the policy end, and both give you patterns to copy: the simulated user for multi-turn cases, and the resolved-issue check for outcome graders.

Pass rates are the evidence for moving an action up the ladder

The autonomy ladder: place every action deliberately said that promoting an action takes evidence: passing eval results plus a clean stretch in production, and the trajectory eval gives you both. The eval side is the action's cases passing at the bar, with outcomes done and paths clean, held in place by the gate so later edits cannot rot the record. The production side is that same rubric read over live receipts at the current rung, so the evidence the ladder asked for now has something real to point to.

Try it now

The drill takes about twenty minutes and runs on your own agent feature, real or planned.

Record three real runs. Pull three complete traces, goal to final state, every tool call with arguments and results, from your run ledger or trace store. If the feature is still on paper, script three planned runs: the goal, then the calls a competent run would make.

Write the process rubric they should have followed. Keep it to a page: the tools this job may use, the call budget plus margin, and the lines never crossed, drawn from your action inventory and rung placements. A rule you cannot check from a trace gets rewritten until you can.

Score each run one call at a time. Mark each call as needed or wasted, right tool or wrong, inside the lines or over, and then give each run two verdicts, one for the outcome and one for the path, recorded separately. Claude Code is a quick harness here: paste a trace and the rubric, ask for a verdict table with one row per call, then spot-check its verdicts before you trust them.

File every finding as a trajectory case. Each failed call becomes a case, with the starting state and goal as the input and the rubric line it broke as the standard. Those rows start your agent set, and they run at the gate from now on.

Chapter Summary

A product that acts produces a trajectory, the recorded sequence of calls it made to reach the goal, so you grade every run twice, once on the outcome and once on the path.
A run can reach the right result along a path you would never approve, and that lucky pass is a failure waiting to land when conditions shift.
Write a process rubric that scores the path: the right tools and no others, no wasted calls, no loops, and no hard line crossed.
Most of the rubric runs as cheap deterministic checks, with a calibrated judge kept only for the plan-quality questions counting cannot answer.
Build cases from real runs first, add scripted runs for the gaps your traffic has not produced, and always include an adversarial run.
Rerun the whole set on every behavior-shifting change, including a reworded tool description, an added or removed tool, and a model upgrade, and never merge below yesterday's pass rate.
Run evals against sandboxed tools, never production, so writes land in recording mocks and the worst bug in your harness only burns a fixture.
A passing pass rate plus a clean stretch in production is the evidence that moves an action up the autonomy ladder.
Next up is Write the Agent Charter and ship with authority you chose, which gathers every decision from this part into the one document your agent ships under.

Sources

Jimenez, C. E., et al. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Sierra (2024). tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.
Published agent-building guidance from major AI labs on tool scoping, least privilege, and sandboxed execution (2024).

Marks this chapter complete on your course map. Reaching the end does this for you.