Data, AI & MLOps7 min readJun 28, 2026

How to Implement LLM Telemetry Using Prometheus and Grafana

Q: How do I avoid Prometheus cardinality problems with LLM metrics?

Keep labels bounded and low-cardinality. Use model , template , status , and a small enum of error reason values. Never label by user ID, request ID, or prompt content. High-cardinality, per-request data belongs in logs or a tracing backend, not in Prometheus labels.

Techsense Developers

Engineering

If you are running large language models in production and cannot answer basic questions like "what is our p95 latency," "how many tokens did we burn last hour," or "which prompt template is driving timeouts," you need LLM telemetry. The fastest open-source path is to instrument your inference layer to expose Prometheus metrics, scrape them on an interval, and visualize them in Grafana. This post walks through the metrics that matter, the instrumentation code, the scrape configuration, and the dashboard panels I use to keep LLM workloads observable and accountable.

You do not need a commercial observability platform to start. Prometheus and Grafana give you a credible, vendor-neutral foundation, and they integrate with the alerting and tracing tools your platform team likely already runs.

Why LLM Telemetry Is Different From Standard Service Monitoring

Traditional web service monitoring assumes requests are cheap, fast, and roughly uniform. LLM requests violate all three assumptions. A single call can run for tens of seconds, stream tokens incrementally, cost real money per request, and fail in ways that have nothing to do with HTTP status codes (a 200 response can still contain a hallucinated or truncated answer).

That means your LLM telemetry needs to capture dimensions that generic APM tools ignore:

Token counts for prompt and completion, per request and aggregated.
Time to first token (TTFT), which is the latency users actually feel in a streaming UI.
Total generation latency, separate from TTFT.
Cost, derived from token counts and model pricing.
Model and prompt-template labels, so you can compare versions.
Error and rejection reasons, including provider rate limits, context-length overflows, and safety filter blocks.

If you only track request count and HTTP latency, you will miss the failure modes that matter most in LLMOps. Getting this right is part of building durable data and AI platform capabilities rather than a one-off script.

Step 1: Define the Metrics You Will Expose

Before writing code, decide on your metric names and label cardinality. Keep labels low-cardinality. Do not put user IDs, full prompts, or request IDs in labels, because that will blow up Prometheus memory.

A solid starting set:

Metric	Type	Labels
`llm_requests_total`	Counter	`model`, `template`, `status`
`llm_request_duration_seconds`	Histogram	`model`, `template`
`llm_time_to_first_token_seconds`	Histogram	`model`, `template`
`llm_prompt_tokens_total`	Counter	`model`, `template`
`llm_completion_tokens_total`	Counter	`model`, `template`
`llm_errors_total`	Counter	`model`, `reason`

Counters answer "how many," histograms let you compute quantiles like p95 and p99, and keeping status and reason as bounded enums protects your cardinality budget.

Step 2: Instrument Your Inference Code

Here is a Python example using the official prometheus_client library wrapped around an LLM call. The pattern is intentionally provider-agnostic.

import time
from prometheus_client import Counter, Histogram

REQUESTS = Counter(
    "llm_requests_total",
    "Total LLM requests",
    ["model", "template", "status"],
)
DURATION = Histogram(
    "llm_request_duration_seconds",
    "End-to-end LLM request latency",
    ["model", "template"],
    buckets=(0.5, 1, 2, 5, 10, 20, 30, 60),
)
TTFT = Histogram(
    "llm_time_to_first_token_seconds",
    "Time to first streamed token",
    ["model", "template"],
    buckets=(0.1, 0.25, 0.5, 1, 2, 5),
)
PROMPT_TOKENS = Counter(
    "llm_prompt_tokens_total", "Prompt tokens", ["model", "template"]
)
COMPLETION_TOKENS = Counter(
    "llm_completion_tokens_total", "Completion tokens", ["model", "template"]
)
ERRORS = Counter(
    "llm_errors_total", "LLM errors", ["model", "reason"]
)


def generate(client, model, template, messages):
    labels = {"model": model, "template": template}
    start = time.perf_counter()
    first_token_at = None
    completion_text = []

    try:
        stream = client.chat(model=model, messages=messages, stream=True)
        for chunk in stream:
            if first_token_at is None:
                first_token_at = time.perf_counter()
                TTFT.labels(**labels).observe(first_token_at - start)
            completion_text.append(chunk.delta)

        usage = stream.usage  # provider-reported token usage
        PROMPT_TOKENS.labels(**labels).inc(usage.prompt_tokens)
        COMPLETION_TOKENS.labels(**labels).inc(usage.completion_tokens)
        REQUESTS.labels(**labels, status="success").inc()
        return "".join(completion_text)

    except Exception as exc:
        reason = classify_error(exc)  # map to a bounded set of reasons
        ERRORS.labels(model=model, reason=reason).inc()
        REQUESTS.labels(**labels, status="error").inc()
        raise
    finally:
        DURATION.labels(**labels).observe(time.perf_counter() - start)

Two details worth emphasizing:

Record TTFT from the streamed response, not from a synchronous completion. If you do not stream, drop the TTFT metric rather than faking it.
Map raw exceptions to a bounded reason set. A helper like classify_error should return values such as rate_limit, context_overflow, timeout, safety_block, or provider_5xx. Never pass an arbitrary exception string into a label.

If your model usage spans multiple processes (a common pattern with Gunicorn or Uvicorn workers), use the prometheus_client multiprocess mode or aggregate via a sidecar so counters are not reset per worker.

Step 3: Expose and Scrape the Metrics Endpoint

Expose a /metrics endpoint. With FastAPI:

from fastapi import FastAPI, Response
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST

app = FastAPI()

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

Then point Prometheus at it. A minimal prometheus.yml scrape job:

scrape_configs:
  - job_name: "llm-inference"
    scrape_interval: 15s
    metrics_path: /metrics
    static_configs:
      - targets: ["llm-gateway:8000"]
        labels:
          service: "llm-gateway"
          env: "production"

If you run on Kubernetes, use the Prometheus Operator and a ServiceMonitor instead of static targets so scraping follows your pods automatically.

Step 4: Build the Grafana Dashboard for LLM Telemetry

With metrics flowing into Prometheus, add it as a Grafana data source and build panels. These are the PromQL queries I rely on.

Request rate by model:

sum by (model) (rate(llm_requests_total[5m]))

Error rate as a percentage:

100 * sum(rate(llm_errors_total[5m]))
  / sum(rate(llm_requests_total[5m]))

p95 end-to-end latency:

histogram_quantile(
  0.95,
  sum by (le, model) (rate(llm_request_duration_seconds_bucket[5m]))
)

p95 time to first token:

histogram_quantile(
  0.95,
  sum by (le, model) (rate(llm_time_to_first_token_seconds_bucket[5m]))
)

Token throughput per minute:

sum(rate(llm_completion_tokens_total[1m])) * 60

Estimated hourly cost (replace the constants with your model's per-token price):

sum(rate(llm_prompt_tokens_total[1h])) * 3600 * 0.0000005
+ sum(rate(llm_completion_tokens_total[1h])) * 3600 * 0.0000015

Recommended dashboard layout

Organize panels into rows so on-call engineers can triage quickly:

Health row: request rate, error rate, p95 latency, p95 TTFT.
Cost row: token throughput, estimated hourly cost, cost per request.
Breakdown row: error reasons stacked over time, latency split by template.

Use Grafana template variables for model and template so one dashboard serves every deployment. This breakdown view is where regressions surface: when a new prompt template ships, you will see its latency and token cost diverge from the baseline in the template-split panels.

Step 5: Add Alerting Before You Need It

Dashboards are for investigation. Alerts are for waking the right person. Define Prometheus alerting rules that match the failure modes you actually care about.

groups:
  - name: llm-telemetry
    rules:
      - alert: LLMHighErrorRate
        expr: |
          100 * sum(rate(llm_errors_total[5m]))
            / sum(rate(llm_requests_total[5m])) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "LLM error rate above 5% for 10 minutes"

      - alert: LLMLatencyRegression
        expr: |
          histogram_quantile(0.95,
            sum by (le) (rate(llm_request_duration_seconds_bucket[5m]))
          ) > 20
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "LLM p95 latency above 20s"

Set thresholds against your own observed baselines, not arbitrary numbers. Run the system for a week, read the dashboards, then pick thresholds that reflect normal behavior.

Practical Cardinality and Cost Notes

A few guardrails I enforce on every implementation:

Cap your label values. If template can take hundreds of values, group rare ones into an other bucket.
Never label by user, session, prompt text, or response text. Use logs or a tracing system for high-cardinality, per-request detail.
Set histogram buckets deliberately. Default buckets are tuned for fast web requests and will give you useless quantiles for 30-second generations.
Retain raw metrics short, downsample long. Prometheus is not a long-term store. Use a remote-write target if you need months of history.

These constraints matter most in regulated and high-volume settings. Teams in financial services and other regulated industries often pair this open-source stack with stricter retention and access controls, which Prometheus and Grafana support through their respective auth and storage backends.

Putting It Together

A working LLM telemetry stack is not exotic. You instrument the inference path with counters and histograms, expose /metrics, scrape with Prometheus, visualize in Grafana, and alert on baselines you measured rather than guessed. The result is an observability layer that tells you what your models are doing, what they cost, and where they are degrading. That is the difference between operating LLMs with confidence and finding out about an outage from your users.

FAQ

What is LLM telemetry and how does it differ from logging?

LLM telemetry is the structured, numeric measurement of model behavior over time: latency, token usage, error rates, and cost. Logging captures discrete events and message detail. Telemetry aggregates into time series you can graph and alert on. You want both. Telemetry tells you something is wrong and roughly where; logs and traces tell you exactly why for a specific request.

Can I use Prometheus and Grafana for LLM monitoring without a commercial tool?

Yes. Prometheus for collection and alerting, plus Grafana for dashboards, is a complete open-source LLM monitoring stack. You only need additional tooling when you require long-term retention, distributed tracing of multi-step chains, or correlation across many services, and even then you can extend rather than replace this foundation.

How do I track LLM cost in Grafana?

Record prompt and completion tokens as Prometheus counters, then multiply token rates by your model's per-token price inside PromQL. Because pricing differs per model, label your token metrics by model and apply the correct constants per series, or maintain a separate pricing lookup if you prefer to keep math out of dashboards.

How do I avoid Prometheus cardinality problems with LLM metrics?

Keep labels bounded and low-cardinality. Use model, template, status, and a small enum of error reason values. Never label by user ID, request ID, or prompt content. High-cardinality, per-request data belongs in logs or a tracing backend, not in Prometheus labels.

What is the most important LLM metric to start with?

If you can only track one, track p95 end-to-end latency segmented by model, because it captures user-facing performance and surfaces regressions fast. Add error rate and token-based cost immediately after, since those three together cover reliability, experience, and spend.