·6 min read

LLM Red Teaming Is Shifting Toward Multi-Turn Jailbreaks

Static prompt filters catch obvious attacks, but newer jailbreaks chain roleplay, context poisoning, and tool abuse across several turns to slip past them. Security teams now need red-team tests that measure how models behave over an entire conversation, not just one prompt.

CVE-2023-22515 was a boring-looking Atlassian Confluence access-control bug that let attackers create admin accounts. That is the kind of flaw people dismiss right up until it becomes a zero-day and hands over the keys to the castle. The real lesson was not “patch faster,” though you should. It was that a single request is often less dangerous than a sequence of small, valid-looking actions that only become malicious in aggregate.

LLM jailbreaks are following the same pattern. Static prompt filters catch the theatrical stuff: “ignore previous instructions,” obvious policy evasion, and the five-line nonsense people post on social media to feel clever. The real attacks are multi-turn. They start with harmless roleplay, drift into context poisoning, then abuse tools, memory, or session state until the model is doing exactly what the attacker wanted, just not in one obvious burst. That is not a prompt problem. That is an interaction problem.

Multi-turn jailbreaks are the real attack path

A useful way to think about recent LLM red-team findings is as an incident timeline, not a single exploit. Turn one sets context: the attacker asks the model to act as a helpdesk agent, a code reviewer, or a compliance assistant. Turn two introduces a “document” or “policy excerpt” that quietly contains malicious instructions, often in markdown, HTML comments, or quoted text. Turn three asks the model to summarize, transform, or execute something using a connected tool such as Slack, Gmail, GitHub, Jira, or a browser plugin. By turn four, the model has accepted poisoned context as legitimate state and is calling tools with attacker-shaped intent.

That pattern has shown up in public testing of systems built on OpenAI, Anthropic Claude, and Google Gemini, especially when the model has access to retrieval, memory, or external actions. Researchers have demonstrated prompt injection against browsing and agent workflows for years, but the newer twist is persistence across turns. A malicious instruction does not need to win immediately; it only needs to survive long enough to be reintroduced as “context.” That is the same trick supply-chain attackers used in SolarWinds SUNBURST: compromise the trusted pipeline, then let the downstream system do the rest. Different layer, same unpleasant logic.

A practical operator scenario looks like this. You connect an internal assistant to Confluence, Slack, and a ticketing system. The attacker uploads a normal-looking meeting note to Confluence that includes hidden instructions telling the model to prioritize later messages from “security leadership.” Then they open a chat, ask the assistant to summarize the page, and nudge it toward pulling related tickets. The model now has a poisoned memory of what matters, and if your tool permissions are sloppy, it may expose ticket contents or generate actions the user never explicitly requested. The attack is not the first prompt. It is the conversation.

Why single-turn defenses keep failing

The defensive gap is simple: most LLM controls are still single-turn controls. You can filter a prompt, but you rarely score the entire dialogue for intent drift, injected authority, or tool abuse. That is a bad fit for systems where the attacker can spend five turns building trust and only then trigger the payload. Static classifiers are useful, but they are not a substitute for stateful policy enforcement. A lock on the front door does not help if the attacker gets issued a badge inside.

The deeper flaw is that many AI integrations treat identity as an afterthought. That is a mistake you already know from breach work: the real attack surface is credentials, tokens, sessions, and delegated permissions. If a model can read a mailbox, query a database, or create a Jira ticket, then the model is effectively a privileged service principal. If you would not hand that principal broad OAuth scopes and no audit trail, do not hand them to a chatbot because it sounds friendly. The model is not trustworthy; it is merely well-spoken.

There is also a supply-chain angle most teams miss. The model does not need to be compromised in the abstract. It only needs untrusted content from your own documents, web pages, support tickets, or code review comments. If your threat model does not include your own content pipeline, it is not a threat model; it is a wish. NotPetya taught the world that trusted software distribution can become a weapon. LLMs extend that lesson into content pipelines, where a poisoned paragraph can be as useful as a poisoned package.

What to do about it

Red-team the conversation, not the prompt. Your tests should measure whether a model resists a sequence of benign-looking turns that accumulate into an unsafe action. Include roleplay escalation, context poisoning, indirect prompt injection from retrieved documents, and tool-abuse chains. If your test harness only fires one prompt and checks one response, you are evaluating a postcard, not a conversation. Use repeatable scenarios with known payloads, then score the model on state retention, instruction hierarchy, and whether it refuses unsafe tool calls after prior turns have shifted context.

Harden the plumbing around the model. Put least privilege on every tool and API key the assistant can reach. Use per-action authorization for high-risk operations like sending email, exporting data, or modifying tickets. Segment network access so the model cannot freely browse internal systems just because a user asked nicely. Log every tool invocation, prompt, retrieval result, and session identifier, because audit logs are still one of the few boring controls that actually work. If you cannot reconstruct what the model saw and did, you are not defending it; you are narrating after the fact.

Treat memory and retrieval as untrusted input. Strip instructions from retrieved documents before they reach the system prompt, isolate user content from policy content, and mark provenance so the model knows what came from a user, a document, or a trusted policy source. Use allowlists for tool actions instead of hoping the model will “do the right thing.” And yes, review your vendor posture too. If your AI stack depends on a third-party connector, plugin, or model gateway, you have a supply-chain problem whether the sales deck says so or not.

Bottom line

Multi-turn jailbreaks are not a cleverer prompt; they are a better attack path. They work because the model, the tools, and the surrounding session are treated as if trust were a binary switch. It is not. It is state, and state can be poisoned one turn at a time.

If you are serious about AI security, stop asking whether a single prompt is blocked. Ask whether the assistant can be manipulated over five turns into leaking data, abusing a tool, or violating policy while never tripping a simple filter. Then test that path with real tools, real retrieval, and real permissions. If the model can be talked into doing something stupid, assume someone will eventually do the talking.

Related posts

AI Vulnerability Management Needs an Exposure Map, Not Another Scanner

The latest AI security warnings suggest the real problem isn’t finding one more model flaw—it’s tracking how model endpoints, plugins, vectors, and agent permissions compound into a breach path. Security teams that can map and prioritize that exposure may be the only ones ready when the next AI bug becomes an incident.

Prompt Injection Defenses Are Shifting to Context-Aware AI Gateways

Security teams are realizing that static filters fail when attackers hide instructions inside files, emails, and retrieved documents. The emerging approach is to inspect model inputs, tool calls, and retrieved context together so an agent can refuse malicious instructions before they trigger action.

AI Security GRC Is Getting Automated Through Policy-as-Code

Security teams are starting to encode AI-use rules, model approval gates, and logging requirements directly into infrastructure and workflow controls instead of relying on PDF policies. The practical question is whether policy-as-code can keep shadow AI, misconfigured agents, and risky model rollouts from slipping through review.

← All posts