·6 min read

LLM Observability: What to Log, Monitor, and Alert On

A production LLM stack should log prompts, responses, model/version metadata, latency, token usage, refusals, and safety events so teams can detect drift, prompt injection, and cost spikes before users do. This post compares where Langfuse, Helicone, and Arize fit in the pipeline—and which signals each one surfaces best for alerting and anomaly detection.

LLM Logs Are Not Optional When Your Model Can Be Tricked Into Talking

CVE-2024-3094 sat in XZ Utils long enough to get packaged, shipped, and trusted before Andres Freund noticed a few hundred milliseconds of SSH latency and started asking unpleasant questions. That is the right mental model for LLM observability: the thing that breaks production usually does not announce itself with a red banner. It shows up as a weird latency bump, a token spike, a refusal rate that drifts from 2% to 18%, or a prompt that suddenly contains instructions you never wrote.

If you are shipping an LLM feature behind a thin layer of product optimism, log the raw inputs and outputs, not just the happy-path API status code. A prompt that triggers a refusal from GPT-4o, Claude 3.5 Sonnet, or Gemini 1.5 Pro is not just a UX event; it is evidence that your system is being probed, jailbroken, or fed malformed context. The same goes for a response that suddenly gets 4x longer than usual, because token blowups are how you discover your “cheap” assistant is quietly becoming your largest cloud bill.

Log the Exact Prompt, Not the Sanitized Fairy Tale

The minimum useful record is the full prompt payload, the model name, the model version, the system prompt hash, the retrieval context, the tool calls, the final response, latency, input tokens, output tokens, refusal flags, and any safety classification returned by the model provider. If you are using OpenAI’s Responses API, Anthropic’s Messages API, or a hosted wrapper in front of either, capture the provider request ID and your own correlation ID so you can tie one bad answer back to one specific call chain.

Sanitizing prompts before storage sounds noble until you need to reconstruct a prompt injection that came in through a PDF, a support ticket, or a RAG chunk from Confluence. Log redacted views for analysts, sure, but keep the original somewhere with access control and retention rules that match the blast radius. If your app handles PHI, payment data, or source code, “we deleted the prompt” is not a security control; it is an evidence disposal policy.

One contrarian point: do not rely on model refusals as your primary safety signal. Refusals are noisy, provider-specific, and easy to game. A jailbreak that gets a “safe completion” is often more interesting than a hard refusal, because it means the model stayed compliant while still leaking policy-sensitive behavior.

The Signals That Actually Catch Drift, Injection, and Cost Spikes

Drift in LLM land is not just accuracy decay on a benchmark slide deck. It shows up when the same prompt class starts producing different answer lengths, different tool-use patterns, or different refusal behavior after a model upgrade from GPT-4.1 to GPT-4o mini, a prompt template change, or a retrieval index refresh. Track response length distributions, refusal rates by route, tool-call frequency, and top-k prompt clusters over time. If your “invoice assistant” suddenly starts issuing 9x more SQL queries to the billing database, you have either a prompt injection problem or a very expensive product idea.

Prompt injection is easiest to spot when you log the retrieved context alongside the user prompt. The attack often rides in through RAG: a poisoned web page, a malicious document, or a support article that tells the model to ignore prior instructions and exfiltrate secrets. Without the retrieved chunks, you are left guessing whether the model hallucinated or your pipeline fed it garbage. With them, you can compare the exact chunk that preceded a bad tool call against the ones that did not.

Cost spikes are usually a boring operational problem until they are not. A single looping agent can burn through thousands of tokens in minutes if your stop conditions are sloppy. Alert on p95 and p99 token counts per route, not just aggregate spend. If a customer-facing workflow normally uses 800 input tokens and 200 output tokens, and it suddenly starts averaging 6,000 input tokens because someone pasted an entire contract into the chat box, you want that page before finance does.

Where Langfuse, Helicone, and Arize Fit in the Stack

Langfuse is the one you reach for when you want application-level tracing and prompt/version discipline. It is strongest when you need to see a full trace from user input to retrieval to generation to tool call, with prompt templates, datasets, and evaluation hooks attached. If you are iterating on prompt versions and need to compare outputs across releases, Langfuse is useful because it treats prompts as versioned artifacts instead of disposable strings.

Helicone sits closer to the network edge and is useful when you want request-level observability with lightweight routing and proxy-style capture. That makes it practical for teams that want to instrument OpenAI or Anthropic traffic without rebuilding their app. It is especially handy for catching latency outliers, token consumption, and provider-level errors across many endpoints. If your main problem is “we need to see every request without turning the app inside out,” Helicone is the blunt instrument that works.

Arize is the strongest of the three when the question is model quality at scale, especially if you already care about evaluation, drift, and production ML monitoring. Arize Phoenix and the broader Arize stack are good at surfacing embedding drift, retrieval quality problems, and evaluation workflows that go beyond raw logs. If your LLM feature depends on retrieval, reranking, or classification, Arize is the better fit for finding where the pipeline degraded rather than just where the API call happened.

The practical split is simple: Langfuse for traceability and prompt/version management, Helicone for request capture and cost/latency visibility, Arize for evaluation and drift analysis. None of them replaces your SIEM, and none should be treated like a security control by itself. They are telemetry layers. If you are not exporting alerts on refusal spikes, prompt-length anomalies, and tool-call abuse into PagerDuty, Splunk, or Sentinel, you are just collecting expensive screenshots.

Alert on Behavior Changes, Not Just Errors

The alerts that matter are behavioral. Page on a sudden increase in refusals for one route, a jump in average output tokens, a change in the ratio of tool calls to user requests, or a spike in prompts containing jailbreak phrases like “ignore previous instructions” and “developer message.” Those are the early indicators that somebody found a weak spot, or that your own release turned a normal workflow into a token furnace.

Also alert on model/version changes. A silent switch from one provider snapshot to another can change formatting, safety behavior, and tool-use reliability overnight. If you do not log the exact model identifier and provider version, you will spend hours arguing about whether the regression came from the prompt, the retriever, or the vendor’s invisible patch Tuesday. Spoiler: it is usually all three.

The Bottom Line

Log the full prompt/response pair, retrieved context, tool calls, model/version, latency, token counts, and refusal/safety metadata for every production request, then export those fields into your SIEM or alerting stack. Build alerts for token spikes, refusal-rate drift, and tool-call anomalies per route, not just for HTTP errors.

Use Langfuse if you need trace-level debugging and prompt/version comparisons, Helicone if you want request capture at the edge, and Arize if retrieval quality and drift are the real failure modes. If you cannot reconstruct one bad answer end-to-end from logs, you do not have observability; you have a bill.

References

← All posts