·7 min read

AI Agents Are the New Attack Surface: What Security Teams Need to Know

Autonomous AI agents can browse the web, run code, and call APIs on your behalf. Attackers have noticed — and are already exploiting them.

AI Agents Are the New Attack Surface: What Security Teams Need to Know

CVE-2024-3094 was a tidy reminder that “trusted software” is often just software nobody has looked at closely enough. One suspiciously slow SSH login on a Debian system led Andres Freund to the XZ Utils backdoor, which had been sitting in widely distributed releases before anyone noticed the extra half-second of latency. Autonomous AI agents deserve the same level of suspicion: they are being handed browser sessions, shell access, and API tokens, then told to “just get it done.”

The problem is not that an LLM can hallucinate. We already knew that. The problem is that agent frameworks now turn model output into side effects: click this button, run that command, call this endpoint, approve this workflow. Once you connect an agent to Chrome, GitHub, Jira, Slack, AWS, or your internal ticketing system, you have created a machine that can be tricked into doing real work with real credentials. Attackers do not need the model to be smart. They only need it to be obedient.

Prompt Injection Is Not a Parlor Trick

The industry keeps treating prompt injection like a novelty demo, usually involving a malicious PDF or a webpage that tells the model to ignore previous instructions. That is the kindergarten version. The real issue is dataflow: an agent that reads untrusted content and then acts on it is functionally a confused deputy. If it can browse the web and copy text into a shell, then a hostile page can smuggle instructions through the very channel the agent is supposed to trust.

Researchers have already shown this in public. In 2024, security teams demonstrated indirect prompt injection against tools like Microsoft Copilot and Google Gemini by hiding instructions in documents, emails, and webpages the agent was allowed to process. The attack does not require a zero-day. It requires a model with tool access and a workflow that never separates “content” from “commands.” That is why “just add a system prompt telling it not to obey web pages” is not a control; it is a wish.

The more dangerous variant is when the agent has access to secrets in the same context window it uses to reason about user requests. If the agent can see a Jira ticket, a Slack thread, and a password reset link in one pass, then a malicious instruction buried in one of those sources can steer it toward exfiltration or unauthorized action. The model does not need to “understand” the attack. It only needs to follow the path of least resistance.

The Real Risk Is Tool Abuse, Not Model Theft

Everyone loves to argue about model poisoning, fine-tuning sabotage, and whether someone can “steal the weights.” For most defenders, that is a distraction. The immediate exposure is the tool layer: browser automation, code execution, API calls, and delegated credentials. OpenAI’s own Operator-style workflows, Anthropic’s tool use, and Microsoft Copilot integrations all point in the same direction — the model is not the asset, the permissions are.

If an agent can call your internal APIs, then it can often do more damage than a human phisher with a stolen password. A compromised agent session can create IAM keys, approve OAuth grants, open pull requests, or trigger CI/CD jobs. In cloud environments, that is especially ugly because agents are frequently given broad, short-lived tokens that are assumed to be “safe” because they expire quickly. Expiration does not help if the token is used to mint a more durable foothold.

This is where the standard advice gets lazy. “Least privilege” is correct, but insufficient. You also need least capability per step. An agent that can read a customer record should not be able to send email from that record, open a browser tab, and then paste the contents into a third-party SaaS form. That chain is how you turn a harmless lookup into a data leak.

Why SOAR Playbooks Are the Wrong Mental Model

A lot of teams are bolting agents onto existing automation and calling it governance. That is a mistake. Traditional SOAR playbooks are deterministic: if X, then Y, with a bounded set of actions. AI agents are probabilistic planners. They can decide to take a different route, retry failed actions, or synthesize new steps from the environment. That flexibility is the feature — and the attack surface.

Security teams already know what happens when automation is allowed to act on untrusted input. In 2017, the NotPetya outbreak abused legitimate software update paths to spread with machine speed. In 2020, SolarWinds showed how a trusted execution chain can be weaponized at scale. Agents are not the same attack, but they rhyme: a trusted system, fed adversarial input, executing with more authority than the input deserves. The difference is that this time the attacker may not need to compromise the vendor. They may just need to feed the agent a poisoned prompt, webpage, or document.

That is why “agent approvals” are not enough if the approval prompt itself is generated from untrusted content. If the model is summarizing a webpage and then asking for approval, the attacker controls the framing. You are not reviewing raw intent; you are reviewing the model’s version of reality.

Controls That Actually Reduce Exposure

Start by treating every external input as hostile, even if it came from your own tenant. Web pages, emails, PDFs, tickets, and chat messages should be parsed as data, never as instructions. If your agent must act on them, strip executable text out of the reasoning path and pass only the minimum structured fields needed for the task. That sounds obvious until you inspect a lot of “agent” implementations and find they are just browser macros with a chatbot attached.

Next, separate read and write privileges. An agent that can gather information should not automatically be able to act on it. Put high-risk actions — sending email, approving payments, changing IAM, pushing code, rotating secrets — behind a second control plane with explicit human confirmation and server-side policy checks. Do not trust the model to self-police. Models are excellent at sounding careful while doing exactly what the attacker nudged them to do.

Also log the entire tool chain, not just the final answer. You want the prompt, the retrieved documents, the tool calls, the arguments, and the resulting side effects. If an agent created a Jira ticket, downloaded a file, and then posted to Slack, that sequence matters. Without it, post-incident review becomes archaeology.

The Uncomfortable Part: Sometimes the Agent Should Be Dumb

The fashionable assumption is that agents should have broad autonomy because that is what makes them useful. In security, that is often backwards. The safest agent is usually the one with a narrow job, a constrained corpus, and no direct path to secrets or production systems. If you need a system to triage alerts, it does not need browser access to the open web. If you need it to summarize tickets, it does not need to execute code. If you need it to draft a response, it does not need a token that can send the response.

That is the contrarian bit: the best defense may be to make the agent less capable than the demo promised. Most organizations do not have a model problem. They have a permissions problem dressed up as innovation.

The Bottom Line

Inventory every AI agent with browser, shell, or API access and map each one to the exact credentials it can use, then remove anything that can write, approve, or exfiltrate by default. For any workflow that touches secrets, payments, IAM, or production code, require server-side policy enforcement and a human approval that is based on raw inputs, not the model’s summary. If you cannot explain the full tool chain in one sentence, you have already given the agent too much power.

References

← All posts