Observability for LLM apps: tracing, costs, latency, quality drift
LLM applications fail in unique ways that traditional observability misses. The patterns for tracing multi-step flows, tracking costs that vary 100x per call, monitoring quality drift, and debugging hallucinations at production scale.
LLM applications break differently from traditional software. A regular bug is a stack trace; an LLM "bug" is a quality drift you only notice when users complain. A regular latency issue is a slow endpoint; an LLM latency issue is a 30-second reasoning trace where the user is staring at a spinner.
Traditional observability (Datadog, New Relic, Sentry) tells you the API call succeeded in 8.4 seconds and used 12,847 input tokens. It doesn't tell you whether the response was good, whether the model hallucinated, whether the wrong tool was called, or whether quality has drifted since last week.
Production LLM systems need a different observability stack. Or at least, additional layers on top of traditional ones. This article covers what's distinct, what to instrument, and the patterns that work.
What's different about LLM observability
A few characteristic features of LLM systems that traditional observability doesn't address:
Non-determinism at the unit level. The same input produces different outputs across calls. "Did this work right?" can't be answered by checking status codes.
Quality as a primary metric. Latency and cost matter, but quality matters most — and is the hardest to measure.
Multi-step traces. A user query might trigger 5-50 LLM calls (agent loops, RAG retrieval, structured extraction, reflection). Each call is part of a larger trace.
Token-level cost variability. A single call can range from €0.001 to €1.00 depending on prompt size, output length, model. Aggregate costs need per-call attribution.
Drift over time. Models update. Prompts evolve. Input distributions shift. Quality moves; you need to see that movement.
Sensitive payloads. The inputs and outputs are often the most valuable diagnostic data but also the most sensitive. Logging discipline matters.
Long async flows. Agent runs that take minutes. Background batch jobs. Streaming responses. Traditional request/response observability doesn't fit.
These aren't theoretical concerns. Every production team running LLM systems hits them.
The observability stack
A complete LLM observability stack has these layers:
1. Call-level instrumentation. Every LLM call is logged: input, output, latency, cost, model, status.
2. Trace-level instrumentation. Multi-call workflows are stitched into traces. You can see the full call chain for a user request.
3. Application-level metrics. Per-feature, per-user, per-tenant aggregations.
4. Quality monitoring. Sampled or full automated quality assessment.
5. User feedback capture. Explicit (thumbs up/down) and implicit (regenerate, abandon) signals.
6. Alerting. Real-time alerts on cost spikes, latency degradation, quality drops, error rate increases.
7. Debugging tools. When something breaks, you can find the trace, see the inputs and outputs, understand what happened.
We'll go through each.
Call-level instrumentation
Every LLM call should produce a log record with:
{
"call_id": "uuid",
"timestamp": "2026-05-15T14:23:45Z",
"trace_id": "uuid", // for grouping into traces
"span_id": "uuid", // for parent-child relations
"feature": "summarize_document",
"prompt_version": "v3.2",
"model": "claude-4-sonnet",
"provider": "anthropic",
"input_messages": [...],
"output_message": "...",
"input_tokens": 1842,
"output_tokens": 384,
"total_tokens": 2226,
"cost_usd": 0.0084,
"latency_ms": 2340,
"first_token_ms": 1240, // streaming
"status": "success",
"error": null,
"user_id": "user_123",
"tenant_id": "tenant_45",
"metadata": {
"session_id": "...",
"experiment_arm": "v3_test"
}
}This is the bare minimum. Capture everything.
Key implementation choices:
Where to log. Options:
- A dedicated observability tool (Helicone, LangSmith, Phoenix, Braintrust, Arize).
- A general observability platform with LLM extensions (Datadog LLM Observability, Sentry).
- Your own logs/database.
For most teams: pick a dedicated tool. They have purpose-built UIs for inspecting LLM calls. The learning curve is mild compared to building your own.
For larger teams: a mix. Use a dedicated tool for the LLM-specific UI, but also pipe data to your central observability platform for cross-system correlation.
How to instrument. Options:
- A proxy that sits between your app and the LLM provider (Helicone's model).
- A wrapping SDK in your application code.
- A wrapper class that you call manually for each LLM invocation.
Proxies are easiest but add latency. SDKs are clean but require integration. Manual wrapping is most flexible but easiest to forget.
A pragmatic approach: SDK wrapping at the boundary where your app calls the LLM. One place to instrument; everything else flows through.
What to log. Some practical considerations:
- Truncate very long inputs/outputs (but log truncation).
- Hash or redact PII (with the original retrievable via secure lookup if needed).
- Don't log credentials, even in error paths.
- For streaming, log both first-token-latency and total-latency.
Trace-level instrumentation
A single user action often involves many LLM calls. Without trace-level instrumentation, you have a thousand call logs and no way to know which calls were part of which user action.
Implementation:
Trace ID generation. Generate a unique trace ID at the start of a user request. Pass it through all subsequent calls.
Parent-child spans. Within a trace, each call has a span ID and (optionally) a parent span ID. This creates a tree showing the call hierarchy.
Operation naming. Each span is named ("summarize_document", "extract_entities", "tool_call:search"). The trace shows the full operation chain.
A trace view in your UI:
Trace abc-123 (12.3s total)
├─ classify_intent (450ms) [gpt-5-mini]
├─ retrieve_documents (1.2s) [embedding + search]
├─ generate_response (8.5s) [claude-4-sonnet]
│ ├─ tool_call: search_internal (320ms)
│ ├─ tool_call: lookup_customer (180ms)
│ └─ generate_final_text (7.5s)
└─ judge_response_quality (2.1s) [claude-4-haiku]Now you can see what your system actually did for this user request. You can find slow calls, expensive calls, failed calls — in context.
Tools that handle this well: LangSmith, Phoenix (Arize), Helicone with custom integration. Plus general observability (Datadog, OpenTelemetry) for cross-system tracing.
Application-level metrics
Beyond individual calls, aggregate metrics:
Per-feature.
- Call volume.
- Average latency.
- p50, p95, p99 latency.
- Average cost per request.
- Error rate.
- Quality score (if measured).
Per-user/per-tenant.
- Calls per user per day.
- Cost per user.
- Heavy users / abuse patterns.
Per-model.
- Volume by model.
- Cost share by model.
- Error rate by model.
- Quality (where measured) by model.
Per-feature × per-model.
- Which features use which models?
- Where could we route to cheaper models?
These dashboards drive operational decisions: which features are expensive, which are slow, which need optimization.
Quality monitoring
The hardest layer: automated quality assessment.
For evals (which we've covered separately) you run on a defined dataset. For online quality monitoring, you assess production traffic.
Approaches:
LLM-as-judge on a sample. Sample, say, 1% of production traffic. For each, run a judge LLM that scores the response on relevant dimensions. Track scores over time. Alert on drops.
Implicit signals. Track regenerations, abandonments, error rates, time-to-completion, follow-up message frequency. These are weak signals but cheap. Use as leading indicators.
Explicit user feedback. Thumbs up/down, "did this help?" buttons, explicit reports. Highest signal but lowest volume.
Pattern detection. Specific bad patterns ("I cannot help with that", "I'm just an AI", repeated refusals) flagged automatically. Catches some regressions immediately.
A typical setup:
- Sample 1% of production calls.
- Run an LLM judge on each, scoring multi-dimensionally.
- Aggregate to per-feature daily scores.
- Alert if any feature drops >10% week-over-week.
The cost is real but bounded (1% of traffic × small judge model = manageable for most teams).
Cost observability
LLM costs can spiral. A single bug — a runaway retry loop, a feature that uses 10x more tokens than expected — can produce a 10x bill before anyone notices.
Cost observability layers:
Per-call cost. Every call's cost computed at log time. Aggregations available immediately.
Budget alerts. Daily, weekly, monthly budgets per feature/tenant. Alert when crossing thresholds (50%, 75%, 90%, 100%).
Anomaly detection. Daily cost is 5x typical? Alert. Single-call cost is 100x typical? Alert.
Cost attribution. Per-feature, per-tenant, per-user costs. Find the heavy hitters.
Forecast. Based on current trajectory, what will the bill be at end-of-month?
A useful dashboard view: a single panel showing today's cost vs the rest of this week vs last week, broken down by feature.
Practical tip: set hard limits where possible. A feature that should cost €X/day has an automatic shutoff at 10X. Cost runaway happens fast; the limit catches it.
Latency observability
LLM latency is more nuanced than typical APIs:
Total latency. From request to final response.
Time-to-first-token (TTFT). For streaming, when does the user see the first character? This dominates perceived latency for chat UX.
Time-to-last-token (TTLT). How long until the response is complete?
Tokens-per-second. Output rate. Some models stream slower than others.
Tool call latency. For agent flows, time spent in tool calls vs LLM calls.
Track all of these. Different optimization strategies target different metrics.
For user-facing chat: TTFT is what matters. A slow first token feels broken; a slow tokens-per-second feels gradual.
For batch: total latency matters; throughput matters more.
For agents: tool call latency often dominates; optimizing the LLM doesn't help if tools are slow.
Error observability
LLM-specific errors:
API errors. Rate limits, auth failures, server errors. Same as any API.
Validation errors. Structured output didn't conform to schema. Track frequency by feature.
Content filter errors. Provider blocked the request. Track to detect prompt issues.
Tool errors. Specific tools failing. Track per tool.
Quality errors. Judge LLM scored output as bad. Track over time.
Hallucination signals. Detection of likely hallucinations (model claimed something not in the source). Hard to detect automatically but can be approximated.
Cost errors. Calls that cost much more than expected. Often indicate a bug.
Each gets its own dashboard. Each can trigger alerts.
Debugging tools
When something breaks, you need to find it and understand it. The debugging surface includes:
Trace search. Find a specific user's request by trace ID, user ID, or timestamp.
Call inspector. For any call, see the full request, response, parameters, latency, cost.
Trace timeline. For complex flows, see the chain of calls visually.
Replay capability. Given an old call, can you re-run it with a different prompt/model and see what would have happened? Critical for testing fixes.
Diff view. Compare two calls — different versions of the same prompt, different models — side by side.
Search by content. Search the corpus of past calls for specific patterns ("show me calls where the model said 'I cannot help'").
These capabilities are what dedicated LLM observability tools provide. Building them yourself is significant work; using a tool is usually cheaper.
Privacy and PII
LLM observability logs are sensitive. Inputs may contain PII; outputs may quote PII. Sometimes you can't avoid logging them — they're needed for debugging.
Practices:
Tokenization/hashing. Replace identifiers with tokens. Original retrievable via separate secure lookup. The bulk of logs are non-PII.
Redaction at log time. Detect and redact PII before it lands in observability storage. Specific patterns (emails, phone numbers, SSNs) replaced with placeholders.
Tenant isolation. Multi-tenant observability data isolated per tenant. One tenant's data not visible to another.
Access controls. Who can view raw inputs/outputs? Logged.
Retention policies. Logs older than X days are deleted or moved to cold storage. PII retention has legal limits in many jurisdictions.
Right-to-be-forgotten. When a user requests deletion (GDPR), their logs must be findable and deletable.
These are not optional for any system handling PII. Get them right early; retrofitting is painful.
Alerting
Thresholds and signals that warrant alerts:
Cost.
- Daily cost > 2x typical.
- Single call cost > €5.
- Hourly cost spike > 5x.
Latency.
- p95 latency > 2x baseline.
- TTFT > 5s for chat UX.
- Tool call timeouts increasing.
Error rate.
- Error rate > 1% (typical baseline is 0.1-0.5%).
- Specific error type spike (validation errors, rate limits).
Quality.
- Quality score dropped > 10% week-over-week on any feature.
- User feedback negative rate > baseline.
- Regeneration rate > baseline.
Pattern.
- Specific bad phrases appearing more often.
- Sudden change in input distribution.
Each alert should be specific and actionable. "Cost is high" is unhelpful; "feature X cost 10x its budget in the last 30 minutes — likely runaway in customer Y's session" is actionable.
Multi-tenant considerations
For B2B SaaS apps:
Per-tenant metrics. Each customer can see their own usage, costs, and quality.
Per-tenant alerts. Specific to their thresholds.
Per-tenant debugging. Support can see a customer's traces (with appropriate access controls).
Per-tenant configuration. Some customers may have different models, prompts, or policies. The observability layer reflects this.
This adds complexity but is essential for B2B at scale. Customer support can't help debug "the AI isn't working for me" without per-tenant trace access.
Tool ecosystem (2026)
A landscape view of LLM observability tools as of writing:
Dedicated LLM observability:
- Helicone. Proxy-based; simple to integrate; strong dashboards.
- LangSmith. Tied to LangChain ecosystem; deep tracing.
- Phoenix (Arize). Open-source-friendly; strong on quality monitoring.
- Braintrust. Strong on evals + observability combined.
- PromptLayer. Prompt-focused; strong on version tracking.
- Weights & Biases Weave. ML-team-friendly; integrates with W&B's broader suite.
General APM with LLM extensions:
- Datadog LLM Observability. Enterprise-grade; expensive.
- New Relic LLM Observability. Similar.
- OpenTelemetry + your APM of choice. OTel has GenAI semantic conventions; instrument once, view in many tools.
Build-your-own:
- A Postgres table with one row per call gets most teams a long way.
- Add a simple UI to search and display.
- Integrate with your existing logging infrastructure.
The right choice depends on team size, scale, budget, and existing tooling. Most teams start with a dedicated tool and migrate to a more comprehensive setup as they scale.
A practical setup
For a typical mid-sized team starting from zero:
Week 1: Pick a tool (Helicone is a common easy start). Integrate with your main LLM call path. Verify calls are being logged.
Week 2: Set up basic dashboards. Per-feature cost, latency, error rate.
Week 3: Set up alerts on cost spikes and error rate increases.
Week 4: Add trace-level instrumentation. Connect spans across multi-call flows.
Month 2: Implement quality sampling. Pick a few features. Set up LLM-as-judge scoring on 1% of traffic.
Month 3: Add user feedback capture. Wire up thumbs up/down or similar.
Month 4: Multi-tenant support, fine-grained alerting, debugging tooling.
This is a working observability stack. Each layer adds capability. Each is worth building. Skipping any of them creates a blind spot.
What goes wrong without it
A short catalog of incidents we've seen at teams without proper LLM observability:
- A bug caused a feature to retry calls in a tight loop. Five-figure unexpected charges accumulated over a weekend. Caught only when the monthly bill arrived. (We've seen variants of this story enough times that the specific numbers matter less than the pattern.)
- A model update silently changed behavior. Quality dropped on a key feature. Users complained but engineering thought it was "occasional weirdness." Two months until correlated with the model change.
- A new prompt version was deployed. It accidentally regressed an important user flow. Without per-flow metrics, nobody noticed for weeks.
- An agent system started looping. Some user sessions had 200+ LLM calls. Took hours of investigation to find the loop, because nobody had trace-level instrumentation.
- A tool authentication failure cascaded into agent confusion. The agent kept "trying things" and racking up costs. Without alerts, this ran for hours.
- A prompt injection caused the AI to leak system instructions. PII was disclosed. Without observability, the team couldn't easily identify which users had been affected.
These aren't theoretical. They happen. Observability prevents most or catches them quickly.
The takeaway
LLM observability is its own discipline. Traditional APM is necessary but insufficient. The specific layers — call-level, trace-level, quality, cost, latency, error, debugging — are all needed.
Tools exist. Pick one. Integrate early. Invest in the dashboards, alerts, and processes that catch issues before users do.
The teams that do this:
- Catch bugs in hours rather than weeks.
- Manage cost predictably rather than getting surprise bills.
- Maintain quality over time rather than drifting.
- Debug systematically rather than guessing.
The teams that don't, eventually have an incident that forces them to. Better to instrument before the incident than after.