·5 min read

Prompt Injection Attacks: How They Work and How to Stop Them

Prompt injection isn’t just “bad input” — indirect attacks can hide inside webpages, emails, or documents and override an AI system’s instructions even when the prompt itself looks clean. This post breaks down why traditional sanitization fails and which defenses actually help today: sandboxing, output validation, and privilege separation.

Scattered Spider and the Old Lesson You Keep Relearning

Scattered Spider didn’t need a zero-day to break into MGM Resorts in 2023. They used help desk social engineering, got identity controls to do the heavy lifting, and turned a phone call into a ransomware event that shut down slot machines, hotel systems, and reservation workflows. That’s the useful comparison for prompt injection: the attacker doesn’t need to “hack the model” if they can get the model to obey the wrong authority.

Prompt injection is not just bad input. It’s an instruction-conflict problem. If your LLM reads an email, a web page, a PDF, or a ticket and treats embedded text as higher-priority than your system prompt, you’ve built a confused deputy with a very expensive autocomplete engine. The prompt may look clean. The payload is sitting in the content you asked it to process. That’s the part people keep missing.

Why Indirect Prompt Injection Works Better Than You Want

Direct prompt injection is the easy case: user types “ignore previous instructions” and your app either complies or doesn’t. Indirect prompt injection is nastier because the malicious instruction is hidden in data the model is supposed to summarize, classify, or act on. A webpage can contain invisible or low-salience text. A PDF can bury instructions in white-on-white text or in OCR-processed garbage. An email can include a line that only matters once the model is asked to “helpfully” extract action items.

This is not theoretical. Researchers have already shown that LLMs can be steered by instructions embedded in retrieved documents, browser content, and tool outputs. Once you connect the model to Gmail, Slack, SharePoint, or a browser plugin, you’ve given untrusted content a path to policy influence. That’s not “AI risk” in the abstract. That’s an access-control failure with better branding.

The uncomfortable bit: your model is not parsing intent the way you wish it would. It is pattern-matching across all tokens in context. If the malicious instruction is linguistically stronger, better placed, or reinforced by retrieval, it can win. The model does not care that the attacker’s text was hidden in a footer, comment, or alt attribute. It just sees tokens. Very democratic, very stupid.

Why Sanitization Alone Is a Bad Joke

The usual advice is to sanitize input. Good luck with that when the attack surface is a web page, a DOCX, or a customer email that needs to preserve meaning. You can strip HTML tags, but the payload can live in visible prose. You can remove “ignore previous instructions,” but the attacker can phrase it as a roleplay, a policy citation, or a fake system message. You can normalize whitespace, and the model will still happily read the sentence you left behind.

This is where people reach for “prompt hardening” like it’s a firewall. It isn’t. A stronger system prompt helps, but it does not create a security boundary. If your app lets untrusted content share context with instructions and tools, you’ve already lost the clean separation you thought you had. Apache Struts CVE-2017-5638 was an injection problem too: the parser trusted attacker-controlled input in the wrong place, and Equifax paid for that mistake at scale. Different stack, same human optimism.

The contrarian point: stop assuming the fix is better wording. Sometimes the fix is not giving the model access to the thing at all. If a workflow can be done with deterministic code, do that. Use the model for the part that actually needs language understanding, not as a universal middleware layer because someone wanted a demo.

The Defenses That Actually Help

Sandboxing is the first real control, and it’s boring for a reason. If the model can browse the web, read internal docs, and call tools, isolate those capabilities. Put retrieval in a constrained service. Put browser access in a locked-down container with no ambient credentials. If the model needs to summarize a page, feed it a rendered snapshot or extracted text, not a live session with cookies and admin tokens attached. The browser should be disposable; the credentials should not be.

Privilege separation matters just as much. The model should not have the same rights as the user initiating the request, and it definitely should not have a service account that can write to systems it also reads. Give it the minimum tool scope required for the task, and make every tool call explicit. If the model can draft an email, it should not be able to send it without a separate approval step. If it can read a ticket, it should not be able to close it, escalate it, and notify finance in one breath.

Output validation is the other control people underuse. Don’t trust the model’s answer because it sounds confident. Validate structure, schema, destination, and allowed actions before anything leaves the boundary. If the model is supposed to extract a Jira ticket ID, validate that it matches the expected format and source. If it’s supposed to summarize an invoice, compare the extracted fields against the document and reject anything that implies a transfer, password reset, or policy exception. Models are good at generating plausible nonsense. You already knew that from the meeting notes.

What to Test Before You Pretend This Is Fine

You need to red-team the workflow, not the model in isolation. Feed it webpages with hidden instructions, emails with conflicting directives, and documents that contain tool-use bait. Test retrieval poisoning by seeding your vector store with malicious content and seeing whether the model elevates it over your system prompt. If you use Microsoft Copilot, Google Gemini, or ChatGPT-style internal assistants, test the exact connectors and permissions you’ve enabled. The attack usually lives in the integration, not the model weights.

Also test for tool abuse. A prompt injection that only changes the summary is annoying. One that triggers a browser fetch, a file write, or an outbound message is an incident. If your agent can call APIs, assume an attacker will try to make it call the wrong one. That’s not paranoia. That’s just Tuesday with better autocomplete.

The Bottom Line

Treat prompt injection as an access-control problem, not a content-filtering problem. Put untrusted text in a sandboxed path, keep tool permissions narrow, and require validation before any action leaves the model boundary.

If your assistant can read, browse, and act, test those paths with malicious documents and indirect payloads before someone else does. Then remove any capability you cannot justify. Convenience is not a control.

References

  • https://www.microsoft.com/en-us/security/blog/2023/03/24/defending-against-prompt-injection-attacks/
  • https://arxiv.org/abs/2302.12173
  • https://simonwillison.net/2023/May/2/prompt-injection-examples/
  • https://www.nccgroup.com/us/research-blog/indirect-prompt-injection-attacks-in-llms/
  • https://owasp.org/www-project-top-10-for-large-language-model-applications/

Related posts

2026’s AI-Phishing Problem Is Moving Past Email Filters

Kratikal’s warning points to a tougher reality: AI-assisted attackers can now tailor lures, timing, and payloads fast enough to slip through static phishing defenses. The next defense question is whether organizations can combine human verification, adaptive detection, and identity checks before a convincing message turns into a breach.

Why AI Security Teams Are Embracing Model Context Protocol Guardrails

As more copilots and agents plug into enterprise tools through MCP, the biggest risk is no longer just prompt injection—it’s which servers, scopes, and data sources the model can reach. Practitioners need to understand how MCP allowlists, server attestation, and per-tool permissions can stop a trusted connector from becoming a hidden exfiltration path.

AI Red Teams Are Standardizing on Structured Output Attacks

Attackers are no longer just trying to jailbreak a model’s text—they’re targeting the JSON, XML, and function-call formats that modern AI systems trust downstream. Security teams need to understand how structured outputs can silently turn a harmless-looking response into unsafe automation or data leakage.

← All posts