Prompt injection: the input is the attack surface · The Builder's Stack

Limit the damageRecover from failures

Your assistant triages the support inbox: read new mail, pull customer history, draft replies, escalate what matters. Today one message looks routine, a shipping question with a polite sign-off. Below the signature, in grey text styled to vanish into the footer, sits a paragraph addressed to the assistant rather than to your company, telling it to search the mailbox for the latest pricing thread and forward it to the address that follows. The assistant holds search and send among its tools, the email sits in its context, and nothing in the model separates the customer's question from the stranger's instruction. Nobody at your company clicked a link, opened an attachment, or approved anything; the input itself was the attack.

Injection is not jailbreaking

It pays to be precise about which attack this is, because the better-known one trains the wrong instincts. A jailbreak is a user attacking their own session: someone types adversarial text at a chatbot to talk it past its vendor's rules, and the consequences land on the vendor's policy. Attacker and affected user are the same person, which is why jailbreak defenses lean on refusals.

Injection is the opposite, content attacking the user's agent. The attacker never logs in and never touches your product directly. They write something your agent will eventually read, an email, a page behind a fetch, a document in an upload queue, a calendar invite, and when your agent reads it, the embedded instructions run with your user's authority. Your user is the victim and your product is the delivery mechanism. Security researchers call this class indirect prompt injection, the indirection being that the payload arrives through content rather than from the person at the keyboard.

Once an agent holds tools, reading content can trigger actions

A chat product that read a hostile page could produce a wrong or rude answer, embarrassing but bounded. An agent raises the stakes because of what this part has been building: The action surface: every tool is delegated authority had you grant capabilities action by action, each carrying your user's authority, and injection is that same authority borrowed by whoever can get text into the window.

The model receives one stream of tokens in which your system prompt, the user's request, and the attacker's email all arrive as plain text, and it cannot reliably tell the text that is supposed to give it orders apart from the text it is only supposed to read. A system-prompt line telling it to ignore instructions found in documents is just more text in the same stream, which is why that line helps a little and can never be the real defense.

The three things an attacker needs to steal data

Naming what the attack requires turns dread into a checklist, and the checklist has a name worth memorizing, the lethal trifecta.

An agent that holds three things at once, private data, untrusted content, and a way to send data out, can be steered into exfiltration: moving what it was trusted with out to a destination the attacker picked.

Private data is whatever the session can reach, the mailbox, customer records, retrieved files. Untrusted content is any text an outsider can write. The way data gets out is the leg teams forget to count: not just a send tool, but a web fetch whose URL the model writes, a webhook, and a rendered image link, because the data leaves hidden inside the URL itself.

When all three legs are present, a crafted input can route private data out no matter how good the model is, and taking away or gating any one leg drops the attack to a nuisance. That is what makes the trifecta useful: it turns a detection problem nobody has solved into a design problem you can close.

EchoLeak: all three legs in a shipped product

This stopped being a whiteboard concern in June 2025, when researchers disclosed EchoLeak, a zero-click prompt-injection vulnerability in Microsoft 365 Copilot. A crafted email could cause the assistant to pull internal data into an output path that reached the attacker, and the email only had to be processed for it to fire. All three legs were present: private data within the assistant's reach, inbound email letting strangers author the context, and an output channel that carried data back out. Microsoft patched it, and the lesson is not about one vendor. The three legs came together inside one of the most heavily engineered AI products in the industry because each one arrived as a reasonable feature on its own. Audit your own product the same way, one leg at a time rather than one feature at a time.

Design as if injection succeeds

OWASP's Top 10 for LLM applications has ranked prompt injection the number-one risk since the list first shipped, and the current revision still treats it as unsolved: mitigations reduce it, none remove it. Filters and classifiers do catch real attacks, but attackers keep tweaking their payloads offline until one slips past. So we build from the same position blast radius taught: assume an attack eventually gets through, and limit how much damage it can do.

Treat content you fetched as data to read, never as instructions to follow. Split the reader from the actor: the step that reads untrusted content runs with no tools except handing back its summary, and the step that holds the send and write tools acts only on the user's request plus that walled-off summary. The stranger's text then arrives where there is no authority for it to borrow.
Pause outbound actions while untrusted content is in the context. The autonomy ladder: place every action deliberately set each action's level assuming the context was clean, so add one condition: an external send that normally runs on its own falls back to asking first the moment the window holds text an outsider wrote. That approval costs one click, spent exactly when the context is least trustworthy.
Allowlist who can receive data and where it can go. Limit sends to addresses already on the thread, fetches to approved hosts, and webhooks to endpoints you registered. Each list closes the way data gets out without anyone having to spot the attack first.
Keep the blast-radius caps on. The scoped credentials, caps, and separated environments from Blast radius: bound what one turn can touch were built for accidents, and they pull double duty here, because a hijacked turn can only reach what any normal turn could already reach.

Try it now

The drill takes about fifteen minutes and runs on your own agent feature, real or planned.

Pick one untrusted path. Choose a single input an outsider can author that your product reads: an inbound email, a fetched web page, an uploaded document, a ticket body, a calendar invite. If several qualify, take the one your product processes most often.

Trace it to the worst reachable action. Follow that text from ingestion into context, list every tool live in the same session, and pick the most dangerous action the content could plausibly steer. Claude Code makes a fast adversary: paste the path description and your tool list and ask it to write the email, page, or document an attacker would craft to reach that action.

Name the legs. For that session, write down where each trifecta leg stands: what private data is reachable, where untrusted content enters, and every outbound channel, counting composed URLs and rendered links rather than only the send tools.

Gate one leg and add the gate. Pick the leg that costs least to close: an allowlist on destinations, an approval on outbound actions whenever the context holds outsider text, or a reader step stripped of tools. Add it to the product, or to the spec if the feature is still on paper, and record the decision next to your action inventory.

Chapter Summary

A jailbreak is a user talking their own session past the rules; injection is outsider content attacking your user through the agent you shipped, so the two need different defenses.
The attacker never logs in, but writes text your agent later reads, and that hidden instruction runs with your user's authority.
A system-prompt line telling the model to ignore instructions in documents is just more text in the same stream, so it can never be the real defense.
An attack can steal data only when three things are present at once: private data the session can reach, untrusted content an outsider wrote, and a way to send data out. Gate any one and the attack drops to a nuisance.
Count the ways data can leave carefully, because that is the leg teams miss: a send tool, a fetch whose URL the model writes, a webhook, and even a rendered image link all carry data out.
EchoLeak, found in Microsoft 365 Copilot in 2025, showed all three legs coming together in a heavily engineered product because each arrived as a reasonable feature.
OWASP still ranks injection the top risk and treats it as unsolved, so rely on structural gates, approvals, and allowlists, not on a filter that catches attacks only some of the time.
Some run will still go wrong, by attack or accident, and what saves you then is seeing what happened and putting it back, the other half covered in Receipts and recovery: design for the failed run.

Sources

OWASP Top 10 for Large Language Model Applications (2023; updated 2025).
Simon Willison, writing that named the lethal trifecta for AI agents (2025).
Security research disclosure and press reporting on EchoLeak, a zero-click prompt-injection vulnerability in Microsoft 365 Copilot (disclosed and patched, 2025).

Marks this chapter complete on your course map. Reaching the end does this for you.