LLM Observability: What to Log, Monitor, and Alert On
A production LLM stack should log prompts, responses, model/version metadata, latency, token usage, refusals, and safety events so teams can detect drift, prompt injection, and cost spikes before users do. This post compares where Langfuse, Helicone, and Arize fit in the pipeline—and which signals each one surfaces best for alerting and anomaly detection.
HTTP/2’s 2023 record still matters more than your model demo
CVE-2023-44487, the HTTP/2 Rapid Reset flaw, powered the largest DDoS attacks in history, with Cloudflare reporting peaks above 201 million requests per second. That should already have cured anyone of the idea that “the protocol is fine, the app is the problem.” LLM stacks are heading for the same lesson: the failure mode is not the model alone, it’s the whole path around it.
If you’re shipping LLM features, you need logs that answer four questions fast: what came in, what the model did, what it cost, and whether the output crossed a line. That means prompts, responses, model and version metadata, latency, token usage, refusals, tool calls, and safety events. Not because observability is fashionable. Because when a prompt injection lands, you will want the chain of evidence before someone starts “debugging” by deleting the only useful traces.
What to log before the first user finds the edge case
At minimum, log the full prompt payload, the system prompt version, the model name, the provider, the temperature, max tokens, tool/function calls, response text, finish reason, latency, prompt tokens, completion tokens, and a request or trace ID that survives retries. If you use retrieval, log the document IDs and chunk hashes, not just “RAG enabled,” because that’s how you later prove the model saw a poisoned snippet from the wrong index.
The non-obvious bit: log refusals and safety filter outcomes as first-class events, not as an afterthought. A refusal rate that jumps from 2% to 18% after a prompt template change is often the first sign that you broke your own system prompt, changed a moderation threshold, or exposed the model to a new injection pattern. The model is not “being weird.” You changed something.
Also log the user-facing latency and the backend latency separately. If you only keep end-to-end timing, you’ll miss the pattern where the model is fast but your retrieval layer is crawling, which is usually the real reason the product feels broken. Users do not care which box is at fault. They care that the spinner is still spinning.
The signals that catch drift, injection, and cost spikes
Prompt injection rarely looks dramatic in the logs. It looks like a normal prompt with a sudden instruction to ignore previous directions, reveal hidden prompts, or call a tool with an oddly specific argument. That’s why you want to alert on semantic changes in prompts, not just on keywords. A crude regex for “ignore previous” is useful, but it is not a strategy. It is a bandage with a dashboard.
For drift, watch response length, refusal rate, tool-call frequency, and top-level topic distribution over time. If your support bot starts producing longer answers and more citations after a model swap from GPT-4o to Claude 3.5 Sonnet, that may be a feature. If it starts hallucinating policy exceptions, that’s a problem. The difference is in the trend line, not the press release.
For cost, alert on token spikes per route, per tenant, and per prompt template. One noisy customer can burn through budget faster than a bad quarter. This is where people still make the mistake of treating cost as finance’s problem. It isn’t. It’s an abuse-detection problem with an invoice attached.
Where Langfuse, Helicone, and Arize fit
Langfuse is the one you reach for when you want application-level tracing without pretending the model is a black box. It captures prompts, generations, scores, metadata, and evaluations, and it does a decent job of tying together multi-step traces across retrieval, tools, and model calls. If you need to reconstruct “what happened” after a bad answer or a prompt injection, this is the kind of trace you want.
Helicone is strongest when you care about proxy-style request logging, cost visibility, and simple operational controls across providers. It sits closer to the traffic path, which makes it useful for capturing request/response metadata consistently across OpenAI, Anthropic, and others. If your main pain is “we have no idea which prompt template is burning money,” Helicone tends to surface that faster than the fancier dashboards.
Arize, especially through its Phoenix stack, is better when you want evaluation, tracing, and anomaly detection tied to model quality and retrieval behavior. It is useful for spotting embedding drift, retrieval failures, and output quality regressions that do not show up as obvious errors. In other words, it helps when the system is technically healthy and still wrong, which is the most annoying kind of wrong.
The practical split is simple: Langfuse for traceability, Helicone for traffic and cost hygiene, Arize for evaluation and quality signals. You can use more than one. In fact, you probably should if you enjoy finding out about incidents before your users do.
What each tool surfaces best in production
Langfuse is good at showing the full path from prompt to output, including nested spans for tools and retrieval. That matters when a single user request fans out into multiple model calls, because the failure often happens in the middle, not at the first prompt. It also supports score tracking, which is useful if you have human review or automated checks feeding back into the trace.
Helicone is particularly useful for per-request observability at scale: latency, token counts, cost estimates, and provider-level metadata. If you are routing between OpenAI and Anthropic, or you are testing fallback behavior, Helicone makes it easier to see when one provider starts behaving differently. That is not glamorous, but neither is a surprise bill.
Arize is the one that helps when you need to compare model outputs against expected labels, run evaluations, and inspect retrieval quality. If your RAG pipeline is returning the wrong chunks and the model is confidently building nonsense on top, you need this layer. The old incident response truth applies here too: the thing you can measure is the thing you can fix.
The alerting rules worth keeping
Start with alerts on refusal spikes, token spikes, latency outliers, and tool-call anomalies. Add a separate alert for prompt-template changes that materially alter output length or safety-filter hits. If you have a moderation layer, alert when it goes quiet as well as when it goes loud. A dead detector is just a very expensive decoration.
Do not alert on every low-confidence response. That way lies alert fatigue and a dashboard nobody trusts. Alert on statistically meaningful shifts over a rolling window, then keep the raw traces so you can inspect the outliers. Security work has never been improved by more noise.
And yes, you should sample aggressively. Log everything sensitive you need for forensics, but do not confuse “more data” with “better visibility.” If your traces are too noisy to use during an incident, you have built a museum, not a control plane.
The Bottom Line
Log prompts, responses, model metadata, token counts, refusals, tool calls, and safety events from day one. If you cannot reconstruct a bad answer end to end, you are not observing the system; you are just collecting anecdotes.
Use Langfuse for traceability, Helicone for request and cost visibility, and Arize for evaluation and drift detection. Pick alert thresholds on real behavioral shifts, not vanity metrics, or you will train yourself to ignore the one signal that mattered.
References
- Cloudflare on the HTTP/2 Rapid Reset attacks: https://blog.cloudflare.com/technical-breakdown-http2-rapid-reset-ddos-attack/
- CVE-2023-44487 NVD entry: https://nvd.nist.gov/vuln/detail/CVE-2023-44487
- Langfuse docs: https://langfuse.com/docs
- Helicone docs: https://docs.helicone.ai/
- Arize Phoenix docs: https://docs.arize.com/phoenix
Related posts
Darktrace’s latest threat report says nearly 70% of incidents in the Americas now begin with stolen or misused accounts, not software exploits. As attackers use AI to move faster and adapt in real time, are traditional detection tools becoming too slow to catch the breach?
Prompt injection is still the fastest way to turn a helpful assistant into a data exfiltration path, especially when agents can read files, call tools, or browse the web. This post shows the concrete guardrails teams should deploy now—input isolation, tool अनुमति controls, output filtering, and runtime monitoring.
SOC teams are being promised fewer alerts, faster investigations, and less burnout—but which AI features are actually cutting time to triage, correlating logs reliably, and accelerating threat hunts? This post separates measurable ROI from common failure modes like false confidence, noisy automation, and hallucinated context.