Why Prompt Injection Defense Needs Runtime AI Policy Enforcement
Prompt filtering alone is failing against indirect injection and tool-abuse attacks in agentic systems. Learn how runtime policy enforcement can block risky actions without breaking legitimate LLM workflows.
Why Prompt Filtering Fails When Your LLM Can Actually Do Things
Citrix Bleed CVE-2023-4966 was not subtle: one memory disclosure in NetScaler, and attackers could lift session tokens at scale. LockBit and friends did not need a clever social campaign or a novel exploit chain; they needed a reusable foothold, and Citrix handed them one. That is the part people keep missing when they talk about prompt injection like it is a content-moderation problem. If your LLM can call tools, query internal systems, or trigger workflows, the attack surface is no longer the prompt. It is the action.
Prompt filters are still useful, just not in the way most teams hope. They can catch obvious jailbreaks, profanity, and a few tired patterns from red-team slide decks. They do not reliably stop indirect prompt injection, where malicious instructions are hidden in a webpage, ticket, email, PDF, or CRM note that the model ingests as data. Once the model is allowed to interpret that content and act on it, the attacker does not need to “hack the prompt.” They just need to steer the agent into doing something stupid with legitimate permissions. LLMs are very good at being confidently helpful. That is not a compliment.
Indirect Injection Beats Static Filters Because It Hides in Trusted Data
The reason indirect prompt injection keeps working is simple: the malicious instruction is often embedded in content your pipeline already trusts. A support article, a GitHub issue, a Jira ticket, or a pasted email thread can carry instructions that never look like an attack to a keyword filter. Anthropic, OpenAI, and Google have all documented variants of this problem in agentic workflows, and the pattern is consistent: the model treats untrusted text as if it were operational guidance. That is not a “prompt” problem. It is a trust-boundary problem.
You have seen this movie before. T-Mobile’s repeated breaches from 2021 through 2023 were not a single elegant exploit; they were a series of abuses around exposed interfaces, credential stuffing, and API weaknesses that kept giving attackers an opening. Security failures at scale tend to look boring in hindsight. The same applies here. If your LLM can read a ticket and then open a browser, send mail, or query a database, an attacker only needs one contaminated input to turn a passive assistant into an active accomplice. The model does not need to be malicious. It just needs to be gullible.
Tool Abuse Is the Real Payload, Not the Prompt Text
The first serious mistake is assuming the model’s output is the risk. It is not. The risk is what happens after the model decides to call send_email, create_ticket, run_sql, download_file, or approve_request. Once those tools exist, the model becomes a policy enforcement point whether you wanted one or not. And if you do not define that policy outside the model, you are asking the attacker to respect the very system they are trying to manipulate. That has not worked out well in any other part of security.
Uber’s 2022 intrusion is a useful reminder that social engineering beats elegant controls more often than people admit. The attacker used MFA fatigue and Slack access to move into internal systems and source code. The lesson is not “train users harder.” The lesson is that once an attacker gets a trusted execution path, they will use it. Agentic LLMs create exactly that kind of path, except now the “user” is a model making decisions at machine speed. If you let it act without runtime checks, you have built a very polite insider threat.
Runtime Policy Enforcement Is the Control You Actually Need
Runtime AI policy enforcement means evaluating each proposed action against explicit rules before the action executes. Not after. Not in logs. Before. That policy should inspect the model’s intent, the tool being invoked, the data involved, the user context, and the current session state. If the model wants to send an external email containing customer data, the policy should know whether that is allowed for this user, in this workflow, at this time, and with this destination. If the model wants to pull secrets from a vault or query a production database, the policy should enforce least privilege even when the model sounds very sure of itself.
This is not theoretical. The same design principle already exists in mature security stacks: EDR blocks suspicious process behavior at runtime, CASB and DLP enforce data controls in motion, and identity systems do conditional access based on context. You would not let every process on a host open raw sockets because it “usually” needs network access. Yet people happily let an LLM with a browser and a few API keys wander around internal systems because the demo worked. That is how you end up with an expensive autocomplete that can exfiltrate data.
What Good Enforcement Looks Like in Practice
A useful policy layer is not a giant list of banned words. It is a decision engine with narrow, testable rules. For example: deny any action that transfers sensitive data to a domain not on an allowlist; require human approval for privilege escalation, external sharing, or destructive operations; block tool calls when the model’s confidence is low and the requested action is high impact; and redact or tokenize sensitive fields before they ever reach the model. If the model cannot see the crown jewels, it has a harder time dropping them on the floor.
You also need per-tool scoping. A support assistant does not need the same permissions as a code assistant or a finance workflow agent. That sounds obvious until you inspect real deployments and find one shared service account with enough access to make an incident responder sigh into their coffee. Separate credentials, separate policies, separate audit trails. If the model can impersonate a human, you have already lost the plot.
Why “Just Filter the Prompt” Is the Wrong Hill to Die On
Here is the contrarian bit: prompt filtering is not your primary defense, and treating it as such can make you less secure. Over-filtering legitimate user input often pushes teams toward brittle exceptions and shadow workflows, where people paste data into unmanaged tools because the sanctioned assistant keeps rejecting normal work. That is how security theater metastasizes into actual exposure. A control that annoys everyone and stops nothing is just a tax with a dashboard.
The better approach is layered: sanitize inputs, isolate untrusted content, constrain tools, and enforce policy at runtime. If you want the model to browse, let it browse inside a sandbox with no direct path to secrets. If you want it to draft actions, require a policy engine to approve the action before execution. If you want it to summarize sensitive material, keep the sensitive material out of the model’s full context unless you have a very good reason and a very tight boundary. “Trust but verify” is a cute slogan. In security, verify first.
The Bottom Line
If your LLM can take actions, prompt filtering alone is not a defense; it is a speed bump. Put a runtime policy layer in front of every tool call, and make it enforce least privilege, destination allowlists, and human approval for high-impact actions.
Start by separating read-only assistants from action-capable agents, then scope credentials per workflow and redact sensitive data before model ingestion. If you cannot explain why a given tool call is allowed without hand-waving, the policy is not done.
References
- https://www.cisa.gov/news-events/cybersecurity-advisories/aa23-320a
- https://nvd.nist.gov/vuln/detail/CVE-2023-4966
- https://www.cisa.gov/news-events/cybersecurity-advisories/aa22-074a
- https://www.anthropic.com/news/indirect-prompt-injection
- https://openai.com/index/preventing-prompt-injection/
Related posts
Kratikal’s warning points to a tougher reality: AI-assisted attackers can now tailor lures, timing, and payloads fast enough to slip through static phishing defenses. The next defense question is whether organizations can combine human verification, adaptive detection, and identity checks before a convincing message turns into a breach.
As more copilots and agents plug into enterprise tools through MCP, the biggest risk is no longer just prompt injection—it’s which servers, scopes, and data sources the model can reach. Practitioners need to understand how MCP allowlists, server attestation, and per-tool permissions can stop a trusted connector from becoming a hidden exfiltration path.
Attackers are no longer just trying to jailbreak a model’s text—they’re targeting the JSON, XML, and function-call formats that modern AI systems trust downstream. Security teams need to understand how structured outputs can silently turn a harmless-looking response into unsafe automation or data leakage.