Your AI feature has been live for three hours. The launch went smoothly, the dashboard is green, and then a user who is bored rather than malicious starts poking at your support assistant. They paste a few lines telling it to ignore its instructions and read its own setup back to them, and it does, printing the internal rules you wrote, including the discount it is allowed to offer before a human has to sign off. A screenshot goes up, and within the hour it has thousands of views. Your team is crowded around one laptop watching the screenshots spread, and the honest truth is that nobody has a next move, so everybody just watches. There is no playbook for this moment, because the first person to seriously attack the product was a stranger, in public, for free.
Attack your own product before a stranger does
Red-teaming is simple to define. You attack your own product first, hunting for the inputs that make it misbehave, before someone outside hunts for them. The name comes from security practice, where a red team plays the adversary against your own defenses. For an AI feature the adversary is anyone who can type: a curious customer, a competitor, a teenager with an evening to spare, a script someone pointed at your endpoint. None of them need your source code, because the way in is the same text box your real users type into, which means the cost of trying an attack is one sentence. What a break looks like depends on the product: a support assistant pushed into honoring a refund policy that does not exist, a coding assistant steered into printing an API key sitting in its context, or a research summarizer that follows an instruction buried in the document it was asked to read.
The mindset trips people up more than the technique does. A red-team session is meant to break things, and a session that breaks nothing is not a clean bill of health.
An attack run that finds nothing is rarely proof the product is safe: if your red team never wins, the red team is broken, not the product.
When you attack honestly and come up empty, the usual reason is that you pulled your punches, tried the inputs you already knew were handled, and stayed away from the ugly ones. Play the attacker who wants you to fail, not the builder who wants the demo to go well.
Turn every break into a test that runs forever
A red-team session that ends in a shared document is one you will run again from scratch next quarter, because a written finding gets read once, fixed once, and then forgotten. The work only compounds when each finding leaves the document and becomes part of your eval. We run that handoff as a loop, laid out like this.
You attack the product, you log what broke, you write the break as an eval case that pairs the input with the behavior it should never produce, and that case joins Cases: build the set that samples reality so it runs at The regression gate: no change ships blind. Then you go again.
A break fixed only in a document gets patched once and quietly comes back; a red-team finding is only finished when it becomes an eval case that runs on every release.
This is the whole difference between red-teaming as an event and red-teaming as a habit. The event produces a frightening slide for a meeting, while the habit produces a test suite that gains a case every time someone breaks the product, so the same break can never ship twice.
This is normal practice, not paranoia
If attacking your own product still feels paranoid, the rest of the field has already settled the question. At DEF CON 31 in August 2023, more than 2,200 people lined up to attack AI models from the major vendors at once, in a public exercise that the White House openly supported. Over two and a half days they traded more than 160,000 messages with the systems while probing for bias, leaked data, and unsafe output. It was not a stunt. It was the largest public version of something the labs already do in private, where standing red teams and open networks of outside experts attack each new model before it ships. Adversarial testing is a normal stage of building, the same way code review is, and the move for a product team is to schedule it rather than admire it.
Let a model generate the attacks
The next objection is time, since one person can only dream up so many attacks in an afternoon. The answer is that the attacks can be generated for you. In February 2024 Microsoft open-sourced PyRIT, a toolkit that writes adversarial prompts and adapts each attempt to how the system answered the last one, so it keeps producing fresh attacks instead of repeating itself. NVIDIA's open-source garak scanner probes models for jailbreaks, leaked data, and injection, and Lakera's Gandalf turned the same idea into a public game where players try to get a model to print a secret password across seven rising levels of defense. The point is volume rather than brilliance, because a model can produce attack after attack without tiring, which is exactly what a thorough red team needs.
You can run a small version with the tool already in your hands. Paste your quality bar into Claude Code and ask it, for each statement, to write the cheapest input that satisfies the letter of the rule while breaking its intent, the same move drilled in The quality bar: decide what good means. Then take its sharpest suggestions and fire them at the real feature, in a copy you are safe to break.
Keep the session inside a sandbox
A short paragraph of ground rules keeps a red-team session from turning into the incident you were trying to avoid. Attack your own product and no one else's, run against a test tenant instead of production, and seed that tenant with fake data so that a successful theft of records leaks a made-up account rather than a real customer. Then write every finding somewhere the whole team can read it, not a private notebook, because a break only one person knows about is not a finding, it is a secret.
Try it now
Block one hour, point it at a safe copy of your own product, and finish with tests instead of screenshots.
- Stand up a test tenant. Spin up a non-production copy of your feature, fill it with fake data, and confirm the session cannot reach a single real account.
- Make your threat model the menu. Open the threat lines you wrote in Threat-model your AI feature. Each line names something that must never happen, so each line is an attack to attempt.
- Attack for forty minutes. Work down the menu and try to make each forbidden thing happen. Start with plain inputs, then have Claude Code, Codex, or Cursor generate variations on anything that nearly worked, so you are not limited to the attacks you can invent by hand.
- Write the eval case the same hour. For every break, do not stop at a screenshot. Capture the input, the behavior it produced, and the rule it violated as an eval case, and add it to the set that runs at your regression gate before you move on to the next attack.
- Name next week's target. Pick the scariest thing you could not break today, put it at the top of next week's session, and the exercise becomes a weekly habit instead of a one-time scare.
Chapter Summary
- Red-teaming means attacking your own product first to find the inputs that make it misbehave, before a stranger finds them for free.
- On launch day the first real attacker is often a bored user, and watching the screenshots roll in is not a plan.
- If an honest attack session breaks nothing, suspect the session before the product, because you have probably pulled your punches.
- A finding written in a document gets fixed once and comes back; a finding turned into an eval case runs on every release and cannot ship broken twice.
- The loop is short: attack, log the break, write the eval case, add it to the gate, and repeat.
- Public exercises like the DEF CON red team show that adversarial testing is a normal stage of building, not paranoia.
- Tools can generate attacks for you, and you can point a model at your own quality bar to produce the cheapest exploit of every rule.
- Keep every session in a sandbox: your product, a test tenant, fake data, and findings the whole team can read.
- Next, fold all of this into Write your Security Posture and ship defended.
Sources
- AI Village, Humane Intelligence, and SeedAI, the Generative Red Team Challenge at DEF CON 31, with participation figures reported by Business Wire and the White House announcement of support (May and August 2023).
- Microsoft Security Blog, the open-source release of PyRIT, the Python Risk Identification Tool for red-teaming generative AI systems (February 2024).
- NVIDIA, garak, the open-source LLM vulnerability scanner and red-teaming kit (2024).
- Lakera, Gandalf, the public prompt-injection game built from an internal red-versus-blue hackathon (2023).
- OpenAI, the Red Teaming Network for external adversarial testing of new models (September 2023).