A grieving traveler asked the chatbot on Air Canada's website how bereavement fares worked. It answered that he could book the full-fare flight now and claim the discount within 90 days of travel. The answer was specific, confident, and wrong, because no such policy existed, and the page describing the real one sat elsewhere on the same site. When the airline refused the refund, a British Columbia tribunal ordered it to pay, rejecting the argument that the chatbot was a separate entity responsible for its own answers.
The traveler did not lack intelligence or diligence; he lacked a way to check. The answer showed no source, read exactly like the chatbot's correct answers, and described a policy he had no reason to know.
Every AI product gives wrong answers at some rate, and the real design question is whether your user can tell when it does.
Our essay The Human Factors puts the recommendation in one line: help users tell when you're wrong.
Judging an answer takes knowledge the asker may not have
Metacognition is the monitoring you run on your own thinking: the sense that you do or do not understand, and the judgment that decides when to trust an answer and when to check it. Doing the task is one kind of work, and watching whether the task is going well is another. Writing the refund email is the task itself, and stopping to ask whether the policy it cites is even real is the monitoring on top of it.
AI products hand the task to the model but leave the monitoring with the person, so judging the answer becomes the user's whole job. That judgment is exactly what breaks for the users who need the product most. The sensemaking paradox explains why: checking an answer takes knowledge of the subject, and people turn to AI in the first place when their own knowledge of the subject runs out. An expert reads the output, tests it against what they already know, and comes away sharper. A novice has nothing to test it against, because the knowledge they are missing is the very thing they asked for.
The research behind this part found the same failure years before language models, in software that returned the wrong results whenever it leaned on what the user vaguely remembered, like a label whose meaning had drifted over time, rather than on details the user could see and confirm on the spot. The recommendation that came out of that work became the principle this chapter is built on: give people feedback they can confirm from what is right in front of them, rather than feedback they would have to remember or have no way to inspect.
Recent research on AI-assisted decisions adds a warning and a tool:
- Explanations backfire. Adding a paragraph of reasoning under an answer makes people accept it more, whether the answer is right or wrong. The explanation is just more polished text from the same system, so it gives the user one more thing to believe instead of a way to check.
- A brief forced check helps. Asking people to commit to their own answer first, or holding the AI's answer back for a moment, measurably cuts how often they over-trust it. People dislike the pause, though, which is why everyday checking has to be cheap rather than forced on them.
Ship the check with the answer
Treat checking as a task you design, not something you leave to the user's unaided judgment. Whatever the system produces, deliver the means of checking it in the same view: the sources it drew on, the actions it took, an artifact someone can compare or run. People decide what to trust from the cues in front of them, and when an interface shows no sources, the only cue left is how fluent the writing sounds, which a language model produces at full strength whether it is right or wrong. A real check has to cost only seconds, because if it costs much more than that, most users will skip it.
What a cheap check looks like in shipped products
Perplexity attaches the check to every claim. Each answer carries numbered citations attached sentence by sentence, with the source list displayed above the answer, so confirming any specific statement is one click rather than a research project. The citations do not make the answers correct, and studies of AI search have caught cited answers that were still wrong. What they do is make the check affordable, which is the part the product controls.
NotebookLM narrows the world to what you can inspect. It answers only from the documents you upload, not the open web, and every answer cites back into those documents, so the check is a click into your own file. The user always knows where a claim could have come from, because the product keeps that set small and visible.
The Air Canada chatbot was the same idea with the check removed. A fluent answer, no source, and the real policy sitting unlinked on the same site. When someone proposes shipping answers without a verification path, the tribunal's finding is the counterargument, because it ruled that expecting customers to cross-check the bot was unreasonable.
How to build the check into your product
Show what the answer came from. Whatever the system consulted to produce an output (documents, records, pages, tool calls), put it one click away. A claim with no source can only be believed, while a claim with its source attached can actually be checked.
Deliver high-stakes outputs in a form the user can inspect. Show a diff rather than a description of a change, a preview rather than a promise, a sample of the affected records rather than a count. If users cannot inspect the output for themselves, the only thing you have left them with is faith.
Spend friction at the single riskiest acceptance point. A forced check works and users grumble at it, so place one (a plan to approve, a commit-first prompt, a short pause) at the moment where blind acceptance does the most damage, and nowhere else.
Mark the answers most likely to be wrong, and name the check that settles them. Honest per-answer confidence is mostly out of reach today, so flag the structural risk cases instead: recency-sensitive claims, version-specific details, legal, medical, and money questions. Gemini ships a version of this with its double-check button, which tests an answer's statements against search and highlights which ones found corroboration and which found contradictions. Wherever you put the mark, it belongs where the eye lands first, which is the work of making the warning impossible to miss.
Try this today: time the check on your riskiest outputs
Set aside fifteen minutes. List the three outputs of your product that users act on with the highest stakes, the ones they send, file, deploy, or pay against. For each one, generate a real example and time how long it takes to confirm it is right using only what the product shows. A check that takes more than a minute, or that requires knowledge your users do not have, is a design gap rather than a user problem. For each gap, write the artifact that would close it: a source link, a diff, a sample of affected records, a one-line test. Keep the list, because it feeds directly into running the human factors audit.
Chapter Summary
- Your product will give wrong answers at some rate no matter what you ship, so the design question is whether the user can tell when it does.
- The people who rely on an answer most are usually the least able to judge it, because the knowledge they would need to check it is the knowledge they came to the product to get.
- Adding an explanation under an answer does not help. It makes people accept the answer more whether it is right or wrong.
- Design the check as part of the product instead of leaving it to the user, and keep the check cheap enough that it costs only seconds.
- Put the sources behind an answer one click away, so a claim can be checked instead of only believed.
- Deliver high-stakes outputs in a form the user can inspect: a diff, a preview, a sample of the affected records.
- Spend friction at the single riskiest moment of acceptance, like a plan to approve or a commit-first prompt, and nowhere else.
- Mark the answers most likely to be wrong, such as recency, legal, medical, and money questions, and name the check that settles them.
- All of this assumes a person is still looking at the answer. When the system starts acting on its own, you keep a human in charge of the agent.
Sources
- Flavell, J. H. (1979). Metacognition and Cognitive Monitoring. American Psychologist, 34(10).
- Garner, R., & Alexander, P. A. (1989). Metacognition: Answered and Unanswered Questions. Educational Psychologist, 24(2).
- Pirolli, P., & Russell, D. M. (2011). Introduction to this Special Issue on Sensemaking. Human-Computer Interaction, 26(1-2).
- Pirolli, P. (1999). Human-Information Interaction: Technology and Theory (Information Foraging).
- Bansal, G., et al. (2021). Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. CHI 2021.
- Buçinca, Z., Malaya, M. B., & Gajos, K. Z. (2021). To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI. PACM HCI, 5(CSCW1).
- Moffatt v. Air Canada, 2024 BCCRT 149 (British Columbia Civil Resolution Tribunal).
- Tow Center for Digital Journalism (2025). AI Search Has a Citation Problem. Columbia Journalism Review.