How We Caught AI Mistakes With a Second Agent

When you run many workers in parallel, each one hands back something that looks finished. That is the trap. Finished-looking and correct are different properties, and a confident wrong answer reads exactly like a right one at the moment it lands. So across this build, the work a maker produced was never trusted on its own word. A second worker, with a single job, tried to take it apart first.

The pattern was the same whatever the maker had made. A drafted chapter went to a reviewer whose task was to check it against the voice rules, line by line, and report violations. A set of claims went to a reviewer whose task was to verify each one against reality and flag anything that did not hold. Links went to a check that visited each route and reported the ones that led nowhere. The reviewers were not asked "is this good." They were asked to find what was wrong with it, and they found real things: a banned word that had slipped past the maker, a sentence built as an aphorism, a statement about a real legal case whose status had moved on since the draft was written, links that pointed at pages that no longer existed.

The reason to split the two roles is not process for its own sake. A maker is built, by the framing of its task, to produce a good version of the thing and to believe it has. Ask the same worker to then judge its own work and you get a lenient judge, because the judgment is shaped by the same instruction that made the thing. A separate reviewer, told to assume the work is flawed and to go looking, has no stake in the answer being yes.

A maker and a skeptic are different jobs, and giving both to the same worker is how confident mistakes ship.

The framing of the skeptic matters as much as its separation. A reviewer told to "check this" tends to confirm it. A reviewer told to refute the work, to start from the assumption that something is wrong and report what, actually looks, and the difference in what comes back is large. The point is to spend the effort before the reader does, not after.

None of this is particular to building a website. It is the discipline the Evals part of The Builder's Stack teaches for any AI product: you sample the failure before production samples it for you, and a regression check that runs every time is how a fixed mistake stays fixed. Pairing a maker with a skeptic is that same idea moved one step earlier, into the act of making itself.

Where this goes next

The full version of this discipline, building the test set, grading against it, and gating every change on it, is the Evals part of The Builder's Stack. For where verification sits in the ownership of a shipped product, see the Track section of The AI-Native PM's Ownership Map. And for the parallel making that creates the need for this check in the first place, read How We Used Agent Fleets to Build This Site.

Sources

The verification discipline this applies is taught in the Evals part of The Builder's Stack.
Where it sits in owning a live product is the Track section of The AI-Native PM's Ownership Map.