AI Red Teaming: Break Your LLM Before Attackers Do
A structured red team should test four things in order: threat model, adversarial prompts, tool-abuse paths, and output validation gaps. This post shows a repeatable methodology for finding the failure modes attackers are most likely to exploit in 2026.
Why LLM Red Teaming Fails When You Start With Prompts
Apache Struts CVE-2017-5638 was the kind of bug that made a mockery of “we’ll catch it in testing.” A malformed request header, OGNL injection, and Equifax lost 147 million records. The lesson was not “test harder.” It was: if you don’t understand the attack path first, your testing will happily miss the thing that matters. LLMs are doing the same trick to security teams right now, just with better PR.
If you want to red team an LLM system in 2026, do not start by throwing jailbreak prompts at it like you’re auditioning for a conference demo. Start with the attack surface you actually built: what the model can see, what it can call, what it can emit, and what your downstream systems trust. That order matters. Most failures I see are not “the model answered badly.” They are “the model was allowed to do something stupid, and the rest of the stack believed it.”
Start With the Threat Model You Wish You Had Written
A real threat model for an LLM app is not “prompt injection” scribbled in a box. It is a map of trust boundaries: user input, system prompt, retrieval layer, tool layer, memory, output channels, and any human-in-the-loop review. If your chatbot can query Jira, Slack, Salesforce, or GitHub, then the model is not just generating text. It is a broker for access decisions, and brokers fail in interesting ways.
The useful question is: which attacker are you actually defending against? A bored user trying to make the bot swear is not the same as a competitor exfiltrating internal docs through retrieval, and neither is the same as a fraudster using your agent to trigger account actions. The threat model should name the abuse cases, the target assets, and the blast radius if the model is wrong. If you cannot state what happens when the model is tricked into calling a tool with attacker-controlled arguments, you are not red teaming. You are decorating.
Probe Prompts Like an Attacker, Not Like a Demo User
Prompt attacks are still real, but the useful ones are rarely the cute “ignore previous instructions” variants people paste into slides. The better attacks exploit instruction hierarchy, hidden context leakage, and retrieval contamination. In practice, that means testing whether a malicious document in Confluence, Notion, or SharePoint can override system intent once it is retrieved into context. It also means checking whether the model leaks chain-of-thought, policy text, or tool schemas when nudged by indirect prompts.
You should test for cross-turn persistence too. A lot of systems fail because they treat a single safe response as proof that the session is safe. It isn’t. An attacker can often plant state early, then exploit that state later when the model has already “accepted” a bogus premise. That is the LLM version of getting a foothold and waiting for the operator to make the mistake for you. Very efficient. Annoying, too.
One contrarian point: do not overinvest in “prompt sanitization.” If your defense is mostly regexes stripping keywords, you are defending a modern attack surface with a lint rule. Better to test whether the model can be induced to ignore untrusted instructions altogether, and whether the surrounding application enforces a hard policy before any tool call or retrieval happens.
Break the Tool Layer Before the Model Breaks You
The tool layer is where LLM systems become security incidents. Once the model can call APIs, send email, create tickets, query databases, or execute code, you have moved from text generation to delegated action. That changes the game. The model does not need to be “hacked” in the traditional sense; it only needs to be manipulated into making a legitimate request on the attacker’s behalf.
This is where you test for tool-abuse paths: parameter smuggling, overbroad scopes, missing allowlists, and confused-deputy behavior. If the agent can access Google Drive, Microsoft Graph, or AWS APIs, check whether it can be induced to enumerate resources it should not reveal, or to combine benign actions into a harmful workflow. Also test whether tool outputs are fed back into the model without validation. That loop is how you get self-reinforcing nonsense with an API key attached.
Twilio’s 0ktapus campaign is a useful reminder that attackers do not need elegant exploits when they can abuse the workflow you already trust. SMS phishing hit more than 130 companies because the control plane was human trust plus weak MFA handling, not some cinematic zero-day. LLM agents create a similar problem: if the model can be socially engineered into taking action, your “AI feature” becomes an internal phishing relay with perfect recall.
Validate Outputs Before They Touch Anything Important
Output validation is the part teams skip because it feels unglamorous. That is usually where the compromise lands. If the model produces JSON, SQL, code, email content, or ticket updates, validate structure, content, and intent before downstream systems consume it. Do not assume the model will respect format just because you asked nicely. It is a language model, not a contract.
You want deterministic checks outside the model: schema validation, policy enforcement, allowlisted actions, and human review for high-risk operations. If the model proposes a password reset, a funds transfer, a privilege change, or a production config edit, the system should require explicit confirmation from a trusted control path. Not from the same model that just generated the request. That is not a control. That is a loop.
This is also where you test for data leakage. A model may be perfectly obedient and still leak secrets in its output because the retrieval layer handed it sensitive material. If your red team only checks whether the answer “looks right,” you will miss the quiet exfiltration path. Attackers love quiet paths. They are less noisy than the dashboard.
Build a Repeatable Red Team, Not a Bag of Tricks
A useful methodology is boring in the best way. First, enumerate assets, tools, and trust boundaries. Second, test adversarial prompts against the model and retrieval layer. Third, attempt tool abuse with malicious but plausible inputs. Fourth, validate every output path that can reach a human, API, or database. Then repeat after every prompt change, tool addition, or retrieval source update.
Use real artifacts in testing: poisoned docs, malformed tool arguments, adversarial tickets, and staged secrets with canary values. Measure whether the system blocks, logs, or escalates each attempt. If you cannot reproduce the failure, you cannot fix it, and if you cannot fix it, you are just collecting anecdotes.
The LastPass breach is a decent reminder that the weakest link is often not the shiny surface. A DevOps engineer’s compromised home Plex server became part of the path to encrypted vault access. That is how these things go: one overlooked trust edge, then a long unpleasant week. LLM systems have more trust edges than most teams are prepared to admit.
The Bottom Line
Red team the LLM system in this order: threat model, prompt attacks, tool abuse, output validation. If you start with jailbreaks, you will miss the real failure modes. Treat every tool call as a privileged action and every output as untrusted until validated outside the model.
If you want to find 2026-era attack paths, test retrieval poisoning, cross-turn persistence, and delegated-action abuse with real artifacts. Then wire the results into controls that do not depend on the model behaving itself. That would be a novelty.
References
- CISA Known Exploited Vulnerabilities Catalog: https://www.cisa.gov/known-exploited-vulnerabilities-catalog
- Apache Struts CVE-2017-5638 (NVD): https://nvd.nist.gov/vuln/detail/CVE-2017-5638
- Mandiant on 0ktapus / Twilio phishing campaign: https://www.mandiant.com/resources/blog/0ktapus-phishing-campaign
- LastPass security incident updates: https://blog.lastpass.com/2022/12/notice-of-recent-security-incident/
- OWASP Top 10 for LLM Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/
Related posts
Kratikal’s warning points to a tougher reality: AI-assisted attackers can now tailor lures, timing, and payloads fast enough to slip through static phishing defenses. The next defense question is whether organizations can combine human verification, adaptive detection, and identity checks before a convincing message turns into a breach.
As more copilots and agents plug into enterprise tools through MCP, the biggest risk is no longer just prompt injection—it’s which servers, scopes, and data sources the model can reach. Practitioners need to understand how MCP allowlists, server attestation, and per-tool permissions can stop a trusted connector from becoming a hidden exfiltration path.
Attackers are no longer just trying to jailbreak a model’s text—they’re targeting the JSON, XML, and function-call formats that modern AI systems trust downstream. Security teams need to understand how structured outputs can silently turn a harmless-looking response into unsafe automation or data leakage.