If you are running large language models in production without structured telemetry, you are flying blind. Implementing LLMOps for telemetry and observability means instrumenting every inference call to capture inputs, outputs, latency, token usage, cost, and quality signals, then routing that data into pipelines where you can trace, evaluate, alert, and debug. The practical path is: define the spans and metrics you need, adopt a tracing standard like OpenTelemetry, wire in evaluation scoring, and close the loop with dashboards and alerts. This post walks through how to do that concretely.
Why LLM Observability Is Different
Traditional application observability assumes deterministic behavior. A function called with the same arguments returns the same result, and a 200 response means success. Large language models break both assumptions. The same prompt can produce different outputs across calls, and a syntactically valid response can still be wrong, toxic, or hallucinated.
That means you need to observe things that classic APM tools never tracked:
- Non-deterministic outputs. You cannot assert exact-match correctness, so you need quality scoring instead of pass/fail status codes.
- Token economics. Cost scales with input and output tokens, not request count. A single runaway prompt can cost more than a thousand normal ones.
- Multi-step chains. Retrieval, tool calls, and agent loops mean one user request can fan out into dozens of model and API calls. You need distributed tracing to see the whole path.
- Semantic failures. A response can be fast, cheap, and completely useless. Latency and HTTP status tell you nothing about whether the answer was good.
LLMOps is the discipline that brings these signals under control. It borrows from MLOps and DevOps but adds the evaluation and prompt-management concerns that are unique to generative systems.
The Telemetry Signals You Need to Capture
Before choosing tools, decide what to measure. I group LLM telemetry into four layers.
1. Operational signals
These are the familiar metrics, adapted for inference:
- Latency, broken down by time-to-first-token and total generation time. Streaming responses make these distinct and important.
- Throughput in requests and tokens per second.
- Error rates, including provider rate limits, timeouts, and content filter rejections.
2. Cost signals
- Prompt tokens, completion tokens, and total tokens per call.
- Cost per request, derived from token counts and the model's pricing.
- Cost per user, per feature, and per tenant so you can attribute spend.
3. Quality signals
- Automated evaluation scores such as groundedness, relevance, and toxicity.
- Hallucination and refusal flags.
- User feedback like thumbs up/down or correction events.
4. Context signals
- Full prompt and response payloads (with PII handling, more on that below).
- Model name and version, temperature, and other parameters.
- Retrieval context for RAG systems: which documents were fetched and their relevance scores.
Implementing LLMOps Telemetry with OpenTelemetry
You do not need a proprietary format. OpenTelemetry has emerging semantic conventions for generative AI (gen_ai.* attributes), which lets you emit traces that any compatible backend can read. I recommend standardizing on this so you are not locked into one vendor's SDK.
Here is a minimal Python instrumentation that wraps a model call in a span and records the relevant attributes.
from opentelemetry import trace
from opentelemetry.trace import SpanKind
tracer = trace.get_tracer("llm.app")
def call_model(client, prompt: str, model: str = "gpt-4o-mini"):
with tracer.start_as_current_span(
"chat.completion", kind=SpanKind.CLIENT
) as span:
span.set_attribute("gen_ai.system", "openai")
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.request.temperature", 0.2)
response = client.chat.completions.create(
model=model,
temperature=0.2,
messages=[{"role": "user", "content": prompt}],
)
usage = response.usage
span.set_attribute("gen_ai.usage.input_tokens", usage.prompt_tokens)
span.set_attribute("gen_ai.usage.output_tokens", usage.completion_tokens)
span.set_attribute(
"gen_ai.response.finish_reason",
response.choices[0].finish_reason,
)
return response.choices[0].message.content
For multi-step chains, the key is span nesting. Each retrieval, tool call, and model invocation becomes a child span under a parent trace that represents the user request. That gives you a waterfall view of exactly where time and cost were spent.
with tracer.start_as_current_span("rag.query") as parent:
parent.set_attribute("user.id", user_id)
with tracer.start_as_current_span("retrieval"):
docs = retriever.search(question, k=5)
context = "\n".join(d.text for d in docs)
answer = call_model(client, build_prompt(question, context))
Because every span shares a trace ID, you can later answer questions like "for this slow request, was the retriever or the model the bottleneck?" without guessing.
Adding Evaluation to the Loop
Operational telemetry tells you the system is running. Evaluation tells you whether it is doing the right thing. There are two modes, and you want both.
Offline evaluation runs against a curated dataset before you ship. You compare new prompts or models against a baseline using metrics like exact match, semantic similarity, or an LLM-as-judge score. This belongs in your CI pipeline so a prompt change that degrades quality fails the build.
Online evaluation scores a sample of live production traffic. Because you cannot run expensive judges on every call, sample intelligently: always evaluate flagged or low-confidence responses, and randomly sample the rest.
A common pattern is asynchronous scoring. The user gets their response immediately, and a background worker attaches quality scores to the trace afterward.
def score_response_async(trace_id, question, answer, context):
groundedness = judge_groundedness(answer, context)
relevance = judge_relevance(question, answer)
emit_score(trace_id, "groundedness", groundedness)
emit_score(trace_id, "relevance", relevance)
When you join these scores back to your traces, you can build dashboards that show quality trends over time, not just latency and cost. That is the difference between operating an LLM and merely hosting one.
Handling Sensitive Data and Compliance
Capturing full prompt and response payloads is powerful for debugging and dangerous for privacy. Prompts frequently contain personal data, credentials, or proprietary content. Before you log anything, decide on a redaction and retention policy.
- Redact PII at the instrumentation layer using a detection pass before payloads are written to your backend.
- Tier your retention. Keep metrics long-term, but expire raw payloads on a short schedule.
- Control access to trace payloads with the same rigor you apply to production databases.
These constraints are sharper in regulated settings. Teams in healthcare, finance, and the public sector often need data residency guarantees and audit trails. If that describes your context, the patterns we use across regulated industries emphasize redaction-by-default and strict separation between operational metrics and raw content.
Alerting on the Signals That Matter
Once telemetry flows, set alerts that reflect LLM-specific failure modes rather than generic uptime checks.
- Cost anomalies. Alert when token usage or spend per hour exceeds a rolling baseline. Runaway loops and prompt-injection attacks often show up here first.
- Quality regressions. Alert when average groundedness or relevance drops below a threshold over a window.
- Refusal and content-filter spikes. A sudden rise can signal a broken prompt template or an upstream model change.
- Latency on time-to-first-token. For user-facing chat, this matters more than total latency.
Tie these alerts to the trace data so an on-call engineer can jump from "cost spiked at 14:05" straight to the offending traces.
A Practical Rollout Sequence
You do not implement all of this at once. A sane order of operations:
- Instrument operational and cost telemetry first. This is low-risk and immediately useful for spend control.
- Add distributed tracing across your chains so you can debug multi-step flows.
- Introduce offline evaluation in CI to protect against regressions on deploy.
- Layer in online evaluation with sampling for production quality visibility.
- Wire up alerts and dashboards that combine all four signal layers.
Standing up this stack and keeping it reliable is real engineering work, and it sits alongside the data pipelines and model serving infrastructure described in our Data, AI & MLOps capabilities. The goal is not dashboards for their own sake. It is the ability to answer, at any moment, whether your production LLM is fast, affordable, and correct.
FAQ
What is the difference between LLMOps and MLOps?
MLOps covers the lifecycle of machine learning models broadly: training, versioning, deployment, and monitoring. LLMOps is a specialization focused on large language models, adding concerns that classic ML rarely faces such as prompt management, token-based cost tracking, non-deterministic output evaluation, and tracing across multi-step retrieval and agent chains.
Do I need a dedicated observability platform, or can I use existing APM tools?
You can start with existing tools if they support OpenTelemetry, since LLM traces are still traces. The gap is usually quality evaluation and token-cost analytics, which general APM tools do not provide out of the box. Many teams pair their existing tracing backend with a purpose-built LLM evaluation layer rather than replacing everything.
How do I evaluate output quality without ground-truth answers?
For production traffic where you have no reference answer, use reference-free methods. LLM-as-judge scoring can assess groundedness against retrieved context, relevance to the question, and toxicity. Combine these automated scores with user feedback signals like thumbs up/down to triangulate quality without needing a labeled answer for every request.
How much production traffic should I sample for online evaluation?
There is no universal number. Always evaluate flagged, low-confidence, or negatively rated responses, then sample the remainder at a rate your evaluation budget allows. Many teams begin with a small percentage and increase it for high-risk features. The point is statistical coverage of quality trends, not scoring every single call.
How do I keep prompt and response logging compliant with privacy rules?
Redact PII at the instrumentation layer before payloads reach storage, tier your retention so raw content expires quickly while metrics persist, and restrict access to trace payloads. In regulated environments, add data residency controls and audit logging, and keep operational metrics strictly separated from raw content.



