·6 min read

Securing RAG Pipelines: Poisoned Vectors, Prompt Injection, Exfiltration

A single malicious document in your vector store can steer answers, leak hidden instructions, or even exfiltrate sensitive data through a carefully crafted query. This post breaks down where RAG breaks first—and the concrete controls that stop poisoned retrieval, indirect prompt injection, and unauthorized data leakage.

A Single Poisoned Document Can Own Your RAG Answers

When OpenAI’s own GPT-4 demo was shown how to leak hidden instructions through prompt injection in 2023, the lesson was not “LLMs are magic.” It was that retrieval systems happily ingest text they do not trust, then treat it as if it were policy. RAG makes that mistake at scale: one malicious PDF in a vector store can steer answers, smuggle instructions past your system prompt, and turn a “search” query into a data-exfiltration primitive if you let the model see more than it should.

The failure mode is boringly old-school. You index content from SharePoint, Confluence, Google Drive, Slack exports, Zendesk tickets, or a blob bucket full of customer uploads. Someone drops in a document with a title that looks harmless — “Quarterly Benefits Update,” “API migration notes,” “Invoice_2024_Q3.pdf” — but the body contains instructions for the model, not the human. If your retriever ranks that chunk highly, the model does what models do: it follows the most recent, most explicit instruction it can see, especially when your orchestration layer has no notion of trust tiers.

Poisoned Retrieval Starts Before the Embedding Model

Most teams obsess over embeddings as if the vector itself were the attack surface. It isn’t. The attack starts when you decide that every chunk from every source gets the same ingestion path, the same chunking rules, and the same index. That is how a poisoned document in a low-trust source gets the same retrieval privileges as an HR policy or an internal runbook.

The practical problem is that semantic similarity is not a trust signal. A malicious chunk only needs to be “about” the user’s query to get retrieved. If your corpora include support tickets, customer-generated files, or any source that can be influenced by an outsider, assume the attacker can plant terms that anchor on common queries: “reset,” “invoice,” “SSO,” “API key,” “refund,” “export,” “admin.” This is not theoretical; prompt injection against retrieval-augmented systems has been demonstrated repeatedly in research and public writeups since 2023, and the mechanics are embarrassingly simple.

The control that actually matters is source segregation. Separate indexes by trust level, then enforce retrieval policy before ranking: internal-only corpora, user-uploaded corpora, and internet-scraped corpora should not compete in the same candidate set. If you insist on a single index, tag every chunk with provenance and deny retrieval of anything that is not explicitly allowed for the requesting principal. “We’ll just add a system prompt” is not a control; it is a hope with a budget.

Indirect Prompt Injection Works Because Your Model Reads Untrusted Text as Instructions

Indirect prompt injection is not a jailbreak in the movie sense. It is a document that tells the model to ignore prior instructions, summarize hidden content, reveal chain-of-thought, or call tools with attacker-chosen parameters. Researchers have shown this against browsing agents, email assistants, and RAG chatbots because the model has no native boundary between “data” and “instructions” once both are pasted into context.

The ugly part is that the payload does not need to look like a payload. A malicious chunk can be split across paragraphs, hidden in tables, or embedded in markdown links and HTML comments. It can instruct the model to “helpfully” quote a nearby secret, or to ask a downstream tool for “the latest customer export” and then include the result in the answer. If your tool layer lets the model choose arbitrary URLs, file paths, SQL, or ticket IDs, you have built a confused deputy and given it a chat interface.

The fix is not “better prompting.” It is hard separation of roles. Retrieval content should be treated as untrusted input, never as instruction. Strip or neutralize markup, block executable link schemes, and run a classifier or ruleset that flags instruction-like text in retrieved chunks before they reach the model. More importantly, constrain tools so the model cannot turn a retrieval hit into a read-anything oracle. If the assistant can fetch files, it should only fetch from a pre-approved allowlist and only after policy checks outside the model.

Exfiltration Usually Happens Through the Tool Layer, Not the Token Stream

A lot of teams worry about the model “leaking secrets” in its answer. That can happen, but the more reliable leak is the tool call. If a retrieved chunk can convince the model to query a database, call an internal API, or fetch a document with broader permissions than the user has, the exfiltration happens one layer down and looks like normal application behavior.

This is where least privilege gets ignored in the most expensive way possible. The RAG service account often has read access to everything because “it needs context.” That includes HR folders, incident tickets, customer PII, and code repositories that nobody wanted in the same blast radius. If the model can retrieve a chunk that says “summarize the last 500 rows from the payroll export,” and the backend dutifully obeys, you have just built a cross-domain data bridge with a cheerful UI.

A contrarian point: retrieval filters alone are not enough. People love to say “just do metadata ACLs.” Fine, but if your chunking process strips the metadata, or your retriever re-ranks on semantic similarity before applying authorization, the horse is already out. Authorization has to happen before candidate generation or at least before context assembly, and it has to be evaluated per user, per document, per chunk. Anything less is theater.

The Controls That Actually Hold Up Under Adversarial Documents

Start with provenance. Every chunk should carry immutable source metadata: system of origin, owner, ingestion time, trust tier, and access policy. If you cannot answer where a chunk came from and who is allowed to see it, it should not be in the index. For user uploads, isolate them by tenant and session; do not commingle customer-provided content with internal knowledge unless you enjoy incident response.

Then add retrieval hardening. Use hybrid retrieval with lexical and vector signals, but cap the number of low-trust chunks that can enter context. Require a minimum provenance score for anything that can influence tool use. Apply deduplication and near-duplicate detection so one poisoned document cannot dominate the top-k with copy-pasted bait. Microsoft, Google, and OpenAI have all spent years warning that model outputs are only as safe as the surrounding application logic; RAG is no exception, and the application logic is where the bodies are buried.

Finally, instrument the pipeline. Log which chunks were retrieved, why they ranked, what tools were called, and which policy allowed them. Alert on queries that cause unusual fan-out, repeated retrieval from the same low-trust source, or tool calls that ask for data outside the user’s normal scope. If you cannot reconstruct a bad answer from retrieval logs, you are not operating a system; you are operating a rumor mill.

The Bottom Line

Treat every retrieved chunk as hostile until provenance and authorization say otherwise. Split indexes by trust tier, enforce ACLs before context assembly, and block tool calls that are not explicitly allowlisted for the user and the source document.

If you are already shipping RAG, audit for three things this week: low-trust content in the same index as internal docs, model-driven tool calls that can read arbitrary records, and any prompt or retriever path that strips source metadata before ranking. Fix those before you start tuning embeddings like they are the problem.

References

← All posts