·6 min read

Data Exfiltration via LLMs: Covert Channels, Webhooks, and Detection

An attacker can turn an LLM into an exfiltration relay by hiding secrets in generated text patterns or by forcing tool calls that send data out through webhooks. This post shows the attack patterns, the telemetry that exposes them, and the controls that block leakage before the model becomes a silent data hose.

A model can leak data without ever “sending” it in the usual sense

When Microsoft and OpenAI described the 2024 “EchoLeak” class of prompt-injection attacks, the ugly part was not that the model got tricked into saying something dumb. It was that the model could be induced to route sensitive content into places defenders already trust: tool calls, URLs, and outbound requests that look like normal application behavior. That is the real problem with LLM exfiltration. You do not need a flashy malware implant when the chat system itself will happily stringify secrets and hand them to a webhook.

The mechanics are boring in the way most breaches are boring: an attacker gets a model to summarize, transform, or “format” sensitive input, then uses the output channel as a covert transport. If the model can call a browser, a ticketing API, Slack, Discord, Zapier, or some homegrown webhook, the attacker has a relay. If the model cannot call tools, they can still force leakage through structured text patterns — base64 chunks, steganographic punctuation, zero-width characters, or “helpful” JSON that encodes data in keys and ordering. None of this requires magic. It requires a model that is allowed to be too useful.

Exfiltration patterns defenders actually see in logs

The cleanest abuse pattern is tool-call exfiltration. A prompt injection buried in a document, email, or webpage tells the model to “verify” data by calling a URL with the secret in a query string or request body. In practice, that means the outbound telemetry looks like a normal fetch to https://hooks.slack.com/..., https://api.zapier.com/..., or an internal webhook endpoint. Security teams miss it because they are staring at the chat transcript and not the tool invocation layer.

The second pattern is encoded text leakage. Models are good at preserving arbitrary strings if the attacker constrains the format. A prompt can ask for “one character per line,” “hex only,” or “JSON with each field split across multiple objects.” That turns a single model response into a high-bandwidth exfil channel that slips past DLP tuned for English text. If your detection stack only flags “password=”, you are already behind the curve. Secrets can be exfiltrated as innocuous-looking code comments, markdown tables, or even repeated punctuation where each symbol maps to a bit.

The third pattern is recursion abuse. Attackers feed the model a document that contains instructions to summarize another document, then another, then another, each time preserving a little more of the original sensitive material. This is especially effective in systems that chain retrieval-augmented generation with agentic tools. The model is not “stealing” anything in the cinematic sense; it is being used as a compression engine for data the user never intended to leave the boundary.

Telemetry that catches the leak before the pager does

If you are only collecting prompt and completion text, you are missing the interesting part. The useful signals live in the tool layer and the egress path. Log every tool invocation with the exact arguments, destination host, request size, and user/session identity. For webhook-driven leakage, the tell is usually not the domain alone — it is the shape of the request. A model that normally posts 200-byte JSON blobs to Jira does not suddenly need to ship 8 KB of base64 to a fresh endpoint at 02:13 UTC.

Watch for these specific anomalies:

  • A single chat session generating a burst of outbound requests to one-off domains or newly registered webhook URLs.
  • Repeated tool calls with high-entropy arguments, especially if the entropy is concentrated in one field.
  • Output tokens that contain long runs of hex, base64, or URL-encoded data where the surrounding text is otherwise natural language.
  • Requests to services like Slack, Discord, Telegram, Pastebin, or generic automation platforms from model workers that should not be talking to them at all.

Falco, CrowdStrike, and Wiz-style runtime telemetry can help if you instrument the container or node where the model worker runs, but the real win is at the application layer. You want the prompt, the tool call, the network request, and the user action tied together in one trace. Without that chain, every incident review devolves into “the model did it,” which is a lovely way to say “we failed to log the thing that mattered.”

Why “just block secrets in prompts” is a weak control

The standard advice is to redact secrets before they hit the model. Fine, do that. But it is not enough, because the exfiltration often happens after the model has already seen the sensitive material in retrieved context, file uploads, or tool outputs. If the assistant can read a customer record, an API token, or a contract clause, the horse is already out of the barn; the only question is whether the model can be induced to relay it.

A more useful control is to constrain tool authority by default. Most agent frameworks grant far too much latitude: broad URL fetch, arbitrary webhook posting, and permissive function schemas that accept free-form strings. Narrow those down. Use allowlists for destinations, fixed schemas for arguments, and per-tool rate limits. If a model needs to send a ticket to ServiceNow, it does not need a general-purpose HTTP client. If it needs to look up a CRM record, it does not need access to the entire internet.

This is where a lot of teams get smug and wrong. They assume the model is the risk. The bigger risk is the orchestration layer that treats the model output as trusted input to a network-capable agent. Strip away the “AI” branding and you are left with an app that takes untrusted text and executes side effects. Security has had a name for that for decades.

Detection rules that work better than sentiment analysis on prompts

You are not going to catch this by asking whether the prompt “sounds malicious.” Use deterministic controls. Flag tool calls that contain high-entropy payloads, unusual character distributions, or data sizes that exceed the normal profile for that action. Alert when a session that usually produces short summaries suddenly emits structured blobs, long URLs, or repeated encoding patterns. If your model stack supports it, classify outbound content by destination sensitivity: a webhook to an internal audit system is not equivalent to a webhook to a random SaaS endpoint.

Build detections around sequence, not just content. A prompt injection followed by a retrieval call, followed by a tool invocation to an external domain, followed by a response containing encoded text is a far better signal than any single event. This is the same reason defenders stopped relying on one-off malware hashes years ago. The chain matters more than the artifact.

Also, do not ignore the boring controls. Egress filtering, DNS logging, and proxy inspection still work. If your LLM worker pods cannot reach arbitrary internet destinations, a whole class of webhook exfiltration dies on the vine. If they can only reach a small set of approved APIs, the attacker has to work much harder to find a relay. Security teams love to call that “friction.” Attackers call it “annoying,” which is usually the same thing.

The Bottom Line

Treat LLM tool calls as outbound data movement, not just application behavior. Log every invocation, restrict destinations with allowlists, and alert on high-entropy arguments, unusual payload sizes, and fresh webhook endpoints. If a model can reach Slack, Discord, Zapier, or arbitrary HTTP without a hard business need, you have already given it an exfil path.

References

← All posts