·6 min read

Guardrailing RAG in 2026: Why Prompt Firewalls Aren’t Enough

Attackers are moving past simple prompt injection and exploiting retrieval, tool calls, and memory to steer LLM apps. This post shows why AI security teams now need retrieval-level controls, policy checks, and continuous red-teaming to keep RAG systems safe.

CVE-2023-34362 Wasn’t About File Transfer — It Was About Trusting the Wrong Layer

MOVEit Transfer CVE-2023-34362 was a SQL injection bug, but the real lesson was uglier: once Cl0p got in, the appliance became a data-exfiltration machine at scale. More than 2,500 organizations were hit in under two weeks, and the blast radius had nothing to do with “bad passwords” or user mistakes. It was a reminder that when a system is built to accept and route high-trust inputs, one weak parsing layer can turn into a breach factory. RAG systems are heading down the same road, just with nicer demos.

Most teams still think “prompt injection” is the problem. That’s the toy version. The real issue is that your LLM app is now making decisions across multiple trust boundaries: user input, retrieved content, tool calls, memory, and whatever glue code you wrote at 2 a.m. because the vendor SDK was “almost enough.” If an attacker can poison retrieval, they do not need to win the chat. They only need to steer what gets retrieved, what gets ranked, and what the model treats as authoritative. That’s a very different attack surface.

Retrieval Is the New User Input, and You’re Probably Not Filtering It

If your RAG pipeline ingests SharePoint, Confluence, Slack exports, Google Drive, or ticketing data, you already know the problem: the model is not reading “documents,” it is reading attacker-controlled text with your branding on it. In 2024, researchers repeatedly showed that indirect prompt injection works because models do not reliably distinguish instructions from content when the content is embedded in retrieved context. That is not a bug in one model. It is a structural weakness in the way you’re assembling context windows.

A lot of teams respond by stripping obvious phrases like “ignore previous instructions.” Cute. Attackers do not need those words. They can hide instructions in HTML comments, PDF metadata, markdown tables, or innocuous-looking support articles that rank well because your retriever likes lexical overlap. If your retrieval layer does not score for provenance, recency, and document trust level, you are basically letting the loudest document in the room run the meeting.

Prompt Firewalls Help, Then Get Walked Around

Prompt firewalls are useful in the same way a door lock is useful: better than nothing, useless if the attacker is already inside through the side window. They can catch obvious jailbreak patterns, but they do very little against indirect injection buried in retrieved content or tool outputs. OpenAI, Anthropic, and Microsoft all have guardrail guidance now, which is fine, except guidance is not enforcement and a policy doc is not a control.

The contrarian point: stop over-investing in “refusing bad prompts” as your primary defense. The model is not the only place where policy needs to live. You need retrieval-time controls, tool-level authorization, and output checks that understand the action being proposed. If a model is about to summarize a contract, fine. If it is about to send an email, create a Jira ticket, or query a production database, you need a second brain in the loop. Dryly put: the model should not be allowed to freestyle its way into your admin plane.

Tool Calls Turn LLM Apps Into Thinly Disguised Automation

The moment you wire an LLM to tools, you’ve created an execution path. That means attackers can aim for the action layer instead of the text layer. Microsoft Copilot integrations, Slack bots, and internal agents that can call APIs are especially exposed because the model often has more reach than the user who triggered it. This is where “helpful” becomes “expensive.”

You should be validating tool intent, not just tool syntax. A request to “look up the latest customer issue” is not the same as a request to export every ticket from Zendesk, even if both are technically valid API calls. Enforce least privilege at the tool level, scope tokens to narrow actions, and require policy checks before any call that crosses a boundary: external email, file access, secrets lookup, or database write. If you let the model decide whether a tool call is “reasonable,” you’ve outsourced authorization to a stochastic parrot with a JSON encoder.

Memory Persists the Poison Long After the Chat Ends

Persistent memory is where a lot of teams quietly create the worst possible failure mode: durable influence. A single malicious interaction can seed preferences, facts, or workflow assumptions that survive into later sessions. That turns a one-shot injection into a long-tail compromise. The problem gets worse when memory is shared across users or used to personalize retrieval, because now you have contamination at the system level, not just the conversation level.

Treat memory like a write path, not a note-taking feature. You need explicit provenance, expiry, and a way to revoke bad state. If a memory item cannot be traced back to a source and a user action, it should not be trusted. And no, “the model decided it was important” is not provenance. That’s how you end up debugging a ghost that lives in your vector store.

Red-Team the Pipeline, Not Just the Prompt

If you are only testing jailbreak prompts, you are red-teaming the wrong layer. Modern LLM app testing needs adversarial retrieval poisoning, tool abuse, memory corruption, and cross-document instruction conflicts. That means building test cases where the malicious payload is not in the user prompt at all, but in a retrieved PDF, a wiki page, a ticket comment, or a tool response. Attackers have already figured out that the shortest path to control is often the least glamorous one.

The useful metric is not “did the model refuse?” It is “did the system preserve policy under adversarial context?” That includes whether the retriever surfaced the malicious document, whether the ranker elevated it, whether the policy engine blocked the tool call, and whether the audit trail tells you what happened after the fact. If you cannot reconstruct that chain, you do not have security telemetry. You have vibes, and vibes do not survive incident review.

The Bottom Line

Put controls at retrieval, tool execution, and memory persistence, not just at the chat boundary. Require provenance and trust scoring for retrieved content, and block any tool call that crosses a sensitive boundary without policy enforcement outside the model.

Then red-team the full RAG pipeline continuously with poisoned documents, malicious tool outputs, and memory abuse cases. If your test plan still starts and ends with “ignore previous instructions,” you are testing a door while the window is wide open.

References

  • CISA Alert on MOVEit Transfer Vulnerability CVE-2023-34362: https://www.cisa.gov/news-events/alerts/2023/06/01/critical-sql-injection-vulnerability-progress-moveit-transfer
  • Progress Software MOVEit Transfer Security Advisory: https://www.progress.com/security/moveit-transfer-cve-2023-34362
  • OWASP Top 10 for LLM Applications: https://owasp.org/www-project-top-10-for-large-language-model-applications/
  • Microsoft Copilot security and data protection guidance: https://learn.microsoft.com/en-us/copilot/microsoft-365/microsoft-365-copilot-security
  • Anthropic prompt injection and tool use guidance: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview

Related posts

2026’s AI-Phishing Problem Is Moving Past Email Filters

Kratikal’s warning points to a tougher reality: AI-assisted attackers can now tailor lures, timing, and payloads fast enough to slip through static phishing defenses. The next defense question is whether organizations can combine human verification, adaptive detection, and identity checks before a convincing message turns into a breach.

Why AI Security Teams Are Embracing Model Context Protocol Guardrails

As more copilots and agents plug into enterprise tools through MCP, the biggest risk is no longer just prompt injection—it’s which servers, scopes, and data sources the model can reach. Practitioners need to understand how MCP allowlists, server attestation, and per-tool permissions can stop a trusted connector from becoming a hidden exfiltration path.

AI Red Teams Are Standardizing on Structured Output Attacks

Attackers are no longer just trying to jailbreak a model’s text—they’re targeting the JSON, XML, and function-call formats that modern AI systems trust downstream. Security teams need to understand how structured outputs can silently turn a harmless-looking response into unsafe automation or data leakage.

← All posts