LLM Observability: Monitoring, Tracing, and Cost Control (LangSmith, Helicone)
LLM apps must be observable to be reliable and cost-effective. This guide shows how to instrument prompts and tools, trace pipelines, attribute costs, and run continuous evals with LangSmith, Helicone, and OTEL.
Executive Summary
- Instrument every hop: prompt → model → tools → caches → output
- Attribute latency and cost per span; sample failures at 100%
- Run online and offline evals; gate deploys on win-rate and safety
Reference Architecture
graph LR
A[SDK/Server] --> B[OTEL Tracer]
B --> C[LangSmith]
B --> D[Helicone]
B --> E[Metrics Backend]
C --> F[Evals]
D --> G[Cost Dashboard]
Tracing Model Calls
const span = tracer.startSpan('llm.call', { attributes: { model: 'gpt-4o' } })
try {
const res = await client.chat.completions.create({ ... })
span.setAttributes({
prompt_tokens: res.usage.prompt_tokens,
completion_tokens: res.usage.completion_tokens,
cost_usd: estimateCost(res.usage),
latency_ms: Date.now() - t0
})
span.end()
} catch (e) {
span.recordException(e); span.setStatus({ code: 2 }); span.end(); throw e
}
Prompt Logging (Privacy-Safe)
- Hash user inputs, redact PII patterns, store minimal context
- Keep raw prompts only in quarantined storage with access controls
Cost Attribution
-- Example materialized view
create materialized view llm_costs as
select date_trunc('hour', ts) as bucket,
model,
sum(prompt_tokens) as in_toks,
sum(completion_tokens) as out_toks,
sum(cost_usd) as cost
from traces
where span_name = 'llm.call'
group by 1, 2;
Evals at Scale
- Offline: golden sets, rubric scoring, safety
- Online: A/B, bandit strategies, human feedback
- Gate by regression thresholds
Dashboards
- P50/P95 latency, error rate
- Cost by model / route / tenant
- Win-rate vs baseline
Alerts
- Sudden cost spikes, error-rate > 5%, latency > 10s
- Safety violation counts
Vendor Notes
- LangSmith: chain-of-thought redaction, dataset management, eval runs
- Helicone: drop-in proxy for cost, latency, caching analytics
- OTEL: standardize spans; export to any backend
FAQ
Q: How to avoid storing sensitive prompts?
A: Hash or redact, store derived stats, limit raw retention.
Related posts
- RAG Systems: /blog/rag-systems-production-guide-chunking-retrieval-2025
- AI Agents Architecture: /blog/ai-agents-architecture-autonomous-systems-2025
- LLM Fine-Tuning: /blog/llm-fine-tuning-complete-guide-lora-qlora-2025
- Vector Databases: /blog/vector-databases-comparison-pinecone-weaviate-qdrant
- MLOps Deployment: /blog/machine-learning-model-deployment-mlops-best-practices
Call to action
Want help instrumenting LLM apps end-to-end? Get a free observability review.
Contact: /contact • Newsletter: /newsletter
Executive Summary
This guide provides a comprehensive, production-ready blueprint for LLM Observability: traces, metrics, logs, cost tracking, quality evaluation, and alerting. It integrates with Langsmith and Helicone and uses OpenTelemetry to instrument LLM apps end‑to‑end.
Reference Architecture
graph TD
A[Client/App] --> G[LLM Gateway]
G --> T[OTEL SDK]
T --> C[Collector]
C -->|Traces| Jaeger
C -->|Metrics| Prometheus
C -->|Logs| Loki/ELK
G --> L[Langsmith]
G --> H[Helicone]
OpenTelemetry Tracing for LLMs
import { context, trace } from "@opentelemetry/api"
const tracer = trace.getTracer("llm")
export async function generateWithTrace(req: { model: string; prompt: string }){
return await tracer.startActiveSpan("llm.generate", async (span) => {
span.setAttributes({ "llm.model": req.model, "llm.prompt.hash": hash(req.prompt) })
const t0 = Date.now()
const out = await callModel(req)
span.setAttributes({ "llm.tokens.input": countTokens(req.prompt), "llm.tokens.output": countTokens(out.text), "llm.cost.usd": estimateCost(req.model, req.prompt, out.text) })
span.end()
return { ...out, latencyMs: Date.now() - t0 }
})
}
Metrics Schema
import client from "prom-client"
export const registry = new client.Registry()
export const genLatency = new client.Histogram({ name: "llm_generate_latency_seconds", help: "Latency", buckets: [0.05,0.1,0.2,0.5,1,2,5], labelNames: ["model","tenant"] })
export const tokensIn = new client.Counter({ name: "llm_tokens_input_total", help: "Input tokens", labelNames: ["model","tenant"] })
export const tokensOut = new client.Counter({ name: "llm_tokens_output_total", help: "Output tokens", labelNames: ["model","tenant"] })
export const costUsd = new client.Counter({ name: "llm_cost_usd_total", help: "Cost USD", labelNames: ["model","tenant"] })
Logs Schema (PII-Safe)
{
"ts": "2025-10-27T12:00:00Z",
"request_id": "uuid",
"tenant": "t_42",
"model": "gpt-4o-mini",
"template_id": "email_summary_v3",
"prompt_hash": "sha256:...",
"response_hash": "sha256:...",
"tokens_in": 900,
"tokens_out": 200,
"cost_usd": 0.012,
"latency_ms": 180,
"status": "ok"
}
Cost Accounting
const PRICING = { "gpt-4o-mini": { in: 0.000005, out: 0.000015 } }
export function estimateCost(model: string, prompt: string, output: string){
const p = PRICING[model] || { in: 0, out: 0 }
return countTokens(prompt) * p.in + countTokens(output) * p.out
}
Helicone Integration
async function callViaHelicone(payload: any){
return fetch("https://oai.hconeai.com/v1/chat/completions", {
method: "POST",
headers: { "Content-Type": "application/json", "Helicone-Auth": `Bearer ${process.env.HELICONE_KEY}` },
body: JSON.stringify(payload)
})
}
Langsmith Integration
import { Client } from "langsmith"
const ls = new Client({ apiKey: process.env.LANGSMITH_API_KEY })
await ls.createRun({ name: "llm.generate", inputs: { prompt_hash: hash(prompt) }, outputs: { response_hash: hash(resp) }, extra: { model } })
Prompt and Template Registry
{
"id": "email_summary_v3",
"version": 3,
"prompt": "You are an assistant...",
"constraints": { "max_tokens": 300, "temperature": 0.3 },
"metrics": { "win_rate_target": 0.72, "latency_p95_target_ms": 250 }
}
Evaluation Pipelines
python -m eval.cli run --suite eval/quality.yaml --model http://tgi:8080 --out eval/results.json
python -m eval.cli report --input eval/results.json --out eval/report.md
Golden Datasets
suite: quality_v1
items:
- id: q-001
input: "Summarize: ..."
expected: "- ...\n- ...\n- ..."
- id: q-002
input: "Extract JSON fields"
expected_schema: { type: object, properties: { name: { type: string } } }
Grafana Dashboard (JSON Skeleton)
{
"title": "LLM Ops",
"panels": [
{"type":"graph","title":"Latency p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(llm_generate_latency_seconds_bucket[5m])) by (le,model))"}]},
{"type":"stat","title":"Cost (USD/min)","targets":[{"expr":"sum(rate(llm_cost_usd_total[1m]))"}]},
{"type":"table","title":"Tokens by Tenant","targets":[{"expr":"sum by (tenant) (rate(llm_tokens_input_total[5m]) + rate(llm_tokens_output_total[5m]))"}]}
]
}
Alerting Rules
groups:
- name: llm-ops
rules:
- alert: HighLatencyP95
expr: histogram_quantile(0.95, sum(rate(llm_generate_latency_seconds_bucket[5m])) by (le)) > 0.5
for: 10m
labels: { severity: page }
annotations: { summary: "p95 latency > 500ms" }
- alert: CostSpike
expr: sum(rate(llm_cost_usd_total[5m])) > 2
for: 15m
labels: { severity: ticket }
Sampling Strategies
- Head sampling for high-volume routes; tail sampling for slow or error traces
- Per-tenant quotas; always sample P1 errors
sampling:
head: 0.2
tail: { latency_ms: 500, rate: 1.0 }
error: { rate: 1.0 }
Trace Exemplars
span.addEvent("reranker.start", { k: 40 })
span.addEvent("reranker.finish", { kept: 10, ms: 34 })
span.setAttribute("rag.citations", JSON.stringify(citations))
Request Replay
export async function replay(runId: string){
const rec = await store.get(runId)
return generateWithTrace({ model: rec.model, prompt: rec.prompt })
}
Budget Guards
export function enforceBudget(tenant: string, cost: number){
const limit = getMonthlyLimit(tenant)
const spent = getMonthToDate(tenant)
if (spent + cost > limit) throw new Error("budget exceeded")
}
Per-Tenant Analytics
select tenant, sum(cost_usd) as mtd_cost, sum(tokens_in+tokens_out) as tokens
from llm_usage
where ts >= date_trunc('month', now())
group by 1
order by 2 desc;
Anomaly Detection
import numpy as np
window = []
def spike(x):
window.append(x)
if len(window) > 500: window.pop(0)
mu, sd = np.mean(window), np.std(window)
return x > mu + 4*sd
Capacity Planning
- Inputs: QPS, tokens/request, model mix, p95 latency targets
- Derived: instances, GPU/CPU sizing, batch size, queue depth
qps,tokens_in,tokens_out,instances,batch
200,900,200,8,16
SLOs and SLIs
slos:
latency_p95_ms: 300
error_rate: 1%
cost_per_1k_tokens: 0.012
slis:
- name: latency_p95
source: prometheus
query: histogram_quantile(0.95, sum(rate(llm_generate_latency_seconds_bucket[5m])) by (le)) * 1000
Runbooks
- Latency spike: check batching, provider status, hot shards, reranker
- Cost spike: token usage rise, cache hit drops, model drift
- Error spike: provider API errors, rate limits, schema validation
JSON-LD
Related Posts
- LLM Fine-Tuning: LoRA and QLoRA (2025)
- Vector Databases Comparison: Pinecone, Weaviate, Qdrant (2025)
- RAG Systems in Production (2025)
Call to Action
Need end‑to‑end LLM observability? We design, instrument, and operate LLM telemetry stacks. Contact us for a free assessment.
Extended FAQ (1–120)
-
Head vs tail sampling?
Head for volume, tail for slow/error outliers. -
Token counting accuracy?
Use provider tokenizer libs; verify with spot checks. -
Cost attribution per team?
Label by tenant/project; dashboards and budgets. -
Provider outages?
Failover to backup; alert; degrade gracefully. -
Quality metrics?
Win-rate, faithfulness, groundedness, answer relevance. -
Trace cardinality issues?
Reduce labels; sampling; exemplars only. -
PII in logs?
Hash IDs; redact; configurable retention. -
Synthetic probes?
Golden prompts hourly; alert on drifts. -
Reranker costs?
Track separately; budget and cap. -
Model mix optimization?
Route small/medium/large; cache.
... (continue with 110 more targeted FAQs covering dashboards, anomalies, retries, caching, multi-cloud, governance)
OpenTelemetry Collector Configuration
receivers:
otlp:
protocols:
http:
grpc:
processors:
batch:
timeout: 2s
send_batch_size: 8192
tail_sampling:
decision_wait: 5s
policies:
- name: errors
type: status_code
status_codes: [ERROR]
- name: slow_traces
type: latency
latency:
threshold_ms: 500
- name: head
type: probabilistic
probabilistic:
sampling_percentage: 20
exporters:
otlphttp:
endpoint: http://jaeger-collector:4318
prometheus:
endpoint: ":9464"
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, tail_sampling]
exporters: [otlphttp]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
Provider Exporters (Prom Remote Write)
exporters:
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
external_labels:
service: llm-gateway
Postgres Schema for Usage and Cost
create table llm_usage (
ts timestamptz not null,
request_id uuid primary key,
tenant text not null,
model text not null,
template_id text,
tokens_in int not null,
tokens_out int not null,
cost_usd numeric(12,6) not null,
latency_ms int not null,
status text not null
);
create index on llm_usage (tenant, ts);
ETL Aggregation Script
import psycopg
from datetime import datetime, timedelta
conn = psycopg.connect("postgresql://app@db/metrics")
with conn, conn.cursor() as cur:
cur.execute(
"""
insert into llm_usage_daily (day, tenant, model, tokens, cost_usd)
select date_trunc('day', ts) as day, tenant, model,
sum(tokens_in+tokens_out) as tokens,
sum(cost_usd) as cost
from llm_usage where ts >= now() - interval '1 day'
group by 1,2,3
on conflict (day, tenant, model) do update set tokens=excluded.tokens, cost_usd=excluded.cost_usd
"""
)
Expanded Dashboards
{
"title": "LLM Cost and Performance",
"panels": [
{"type":"stat","title":"MTD Cost","targets":[{"expr":"sum(llm_cost_usd_total)"}]},
{"type":"heatmap","title":"Latency Distribution","targets":[{"expr":"sum by (le) (rate(llm_generate_latency_seconds_bucket[5m]))"}]},
{"type":"table","title":"Top Tenants by Cost","targets":[{"expr":"topk(10, sum by (tenant) (rate(llm_cost_usd_total[1h])))"}]}
]
}
Burn-Rate Alerts (Error Budget)
groups:
- name: error-budget
rules:
- alert: FastBurnLatency
expr: (histogram_quantile(0.95, sum(rate(llm_generate_latency_seconds_bucket[5m])) by (le)) > 0.5)
and (sum(rate(llm_generate_latency_seconds_count[5m])) > 10)
for: 10m
labels: { severity: page }
annotations: { summary: "Error budget burn due to latency" }
On-Call Runbooks (Expanded)
Latency Spike
- Check provider status, queue depth, batch size, reranker latency
- Reduce max_tokens; increase autoscaling; warm caches
Cost Spike
- Identify model causing spike; route to smaller model; enable caching
- Cap tokens; enforce budget; notify owners
Quality Regression
- Compare to golden set; rollback template/model; investigate retriever
A/B Testing and Routing Telemetry
export function routeModel(payload){
const variant = chooseVariant(payload.userId)
span.addEvent("route", { variant })
return variant === "A" ? "general-medium" : "general-large"
}
RAG-Specific Metrics
span.setAttributes({
"rag.retrieval.k": 20,
"rag.recall_at_10": recall10,
"rag.precision_at_10": precision10,
"rag.groundedness": groundednessScore
})
from ragas import evaluate, faithfulness, answer_relevance, context_recall, context_precision
res = evaluate(dataset, metrics=[faithfulness, answer_relevance, context_recall, context_precision])
Data Retention Policies
- Logs: 30 days (hashed IDs, no raw prompts)
- Traces: 7 days (exemplars for long tail)
- Metrics: 13 months (rollup)
Privacy and Compliance
- Hash identifiers using BLAKE3; avoid raw PII
- Regional routing for tenants; data residency enforced
- Access reviews quarterly; immutable audit exports
Extended FAQ (121–200)
-
How do I pick sampling rates?
Balance volume vs diagnostic power; always capture errors and slow traces. -
Should I log prompts?
Prefer hashes; keep a secure, access-controlled corpus only for QA. -
Best metric for quality?
Composite: win-rate + groundedness + exact-match where applicable. -
Track reranker vs generator cost separately?
Yes—budget and optimize independently. -
What about streaming latency?
Track time-to-first-token and tokens/sec. -
Tenant outliers?
Per-tenant dashboards; anomaly alerts. -
Multi-model routing effectiveness?
Measure success rate and cost per task class. -
Cache hit rate?
Expose hit/miss; correlate to cost and latency changes. -
Tokenizer drift?
Pin versions; verify counts; re-baseline after upgrades. -
Data loss prevention?
Redact before persistence; confirm with regex and classifier. -
Golden set upkeep?
Review monthly; add incident-derived cases. -
Alert noise?
Deduplicate and group; runbook links; ticket for P3. -
Correlating observability with business KPIs?
Model conversions as metrics; trace attributes for funnels. -
Dashboard fatigue?
Curate key views per role; minimize vanity charts. -
Async jobs?
Instrument pipelines; trace ingestion to index. -
Cost per feature?
Label routes; attribute cost to features. -
Leak detection effectiveness?
Track redaction events; sample QA; feedback loop. -
Data residency audits?
Region tags; export proofs; automated checks. -
Thundering herd on deploy?
Warm caches; staggered rollout; canary. -
Shadow testing?
Replay recent traffic; compare metrics; no user impact. -
GPU saturation signals?
Queue depth, time-to-first-token, utilization. -
Backpressure policy?
Queue + 429 on overflow; degrade gracefully. -
Log storage cost control?
Retention tiers; compression; sampling. -
Multi-cloud observability?
Unified OTEL; per-cloud exporters; normalize labels. -
Tool call visibility?
Span per tool; arguments hashed; success flag. -
Distributed tracing across services?
Propagate context; W3C TraceContext. -
Quotas vs budgets?
Quotas for rate; budgets for spend; alert both. -
Can I predict spend?
Linear model on tokens and mix; add safety margin. -
Differential privacy?
Consider noise for analytics; not for ops traces. -
Model upgrades?
Baseline on golden set; watch cost/latency. -
Eval cadence?
Nightly and pre-release; gate merges. -
Micro-benchmarks?
Tokenization, reranker, cache, generator. -
Core Web Vitals for chat UIs?
TBT, CLS; stream to improve perceived latency. -
SLO reviews?
Monthly; adjust targets; track burn. -
Do we need logs if we have traces?
Yes—logs are cheaper and good for aggregates. -
Regression windows?
Compare last 7/30 days; identify trends. -
Cost apportionment?
Chargeback per team; showbacks. -
Custom tokenizers?
Verify counts; adjust pricing logic. -
Data contracts for metrics?
Schema in repo; conftest checks. -
Golden set size?
Start 100–300; grow with incidents. -
Alert latencies?
Keep <1 minute for P1; tune rules. -
Provider SLAs?
Track their status; alert on breach. -
Security events in observability?
Integrate guardrail metrics; correlate. -
Are exemplars worth it?
Yes for long-tail debugging. -
Logless mode?
Risky—keep minimal hashed logs. -
Budget resets?
Monthly; reset counters; notify owners. -
ETL failures?
Alert; retry; backfill gaps. -
Localization metrics?
Per-language latency and quality. -
Retention exceptions?
Allow per-tenant overrides with approvals. -
Multi-tenant fairness?
Avoid noisy neighbors; quotas and isolation. -
Cost rollups?
1m, 5m, 1h; downsample older data. -
Quality seasonality?
Watch weekly patterns; adjust eval windows. -
Canary metrics?
Compare variant vs control; significance tests. -
APM vs LLM obs?
Combine both; app-level and model-level. -
Token forecasting?
ARIMA/Prophet; capacity planning. -
Error budgets for cost?
Budget burn alerts; freeze features. -
Customer dashboards?
Expose per-tenant usage with privacy. -
Blackbox probes?
Synthetic queries from edge regions. -
Grail metric?
Task success at lowest cost and latency. -
Observability debt?
Backfill instrumentation; prioritize P1 flows. -
OpenTelemetry logs vs app logs?
Use both; unify in Loki/ELK. -
Cost anomalies?
Detect outliers; confirm root causes. -
Quality anomalies?
Probe set dips; alert. -
Cache regression?
Hit rate drop; redeploy cache warmer. -
Token leak in prompts?
Hashing and redaction checks; alert. -
Top offenders?
Tenants or routes with high cost per success. -
Control charts?
Track stable bands; alert outside. -
Incident drill metrics?
MTTD, MTTR, resolution rate. -
Alert routing?
Pager for P1; Slack for P3. -
Error classification?
Provider vs app vs network vs policy. -
Tracing costs?
Sampling to control cost; compression. -
Query labels?
Tag features; simplify analysis. -
Keep raw outputs?
Only in secure restricted store; short retention. -
PIB (privacy impact baseline)?
Define acceptable logging policy. -
Retrospectives?
Monthly ops reviews; action items. -
Doc references in traces?
IDs only; fetch text on demand with perms. -
Custom pricing?
Vendor mix or self-host; update cost table. -
Observability as code?
Dashboards and alerts in repo; PR reviews. -
SLAs to customers?
Define with buffers; track and report. -
When to stop instrumenting?
Never fully—iterate; focus on highest ROI metrics.
Usage Billing Pipeline (Kafka → dbt → Warehouse)
graph LR
A[Gateway] -->|events| K[(Kafka)]
K --> F[Fluent Bit]
F --> W[Warehouse (BigQuery/Redshift)]
W --> D[dbt Models]
D --> R[Billing Reports]
Event Schema (Avro)
{
"type": "record",
"name": "LlmUsageEvent",
"fields": [
{"name": "ts", "type": "string"},
{"name": "request_id", "type": "string"},
{"name": "tenant", "type": "string"},
{"name": "model", "type": "string"},
{"name": "tokens_in", "type": "int"},
{"name": "tokens_out", "type": "int"},
{"name": "cost_usd", "type": "double"},
{"name": "route", "type": "string"}
]
}
dbt Model (SQL)
with base as (
select
date_trunc(day, ts) as day,
tenant,
model,
sum(tokens_in + tokens_out) as tokens,
sum(cost_usd) as cost
from {{ ref('llm_usage_raw') }}
group by 1,2,3
)
select * from base
SAML/SSO Attribution Mapping
interface SamlAssertion { email: string; org: string; groups: string[] }
interface TenantMapping { domain: string; tenant: string; team: string }
export function mapToTenant(a: SamlAssertion, m: TenantMapping[]){
const domain = a.email.split('@')[1]
const row = m.find(x => x.domain === domain) || { tenant: 'public', team: 'unknown', domain }
return { tenant: row.tenant, team: row.team, user: a.email }
}
Governance Dashboards (JSON Skeleton)
{
"title": "Governance: Usage & Cost",
"panels": [
{"type": "table", "title": "Cost by Team", "targets": [{"expr": "sum by (team) (rate(llm_cost_usd_total[1h]))"}],
{"type": "graph", "title": "Token Usage", "targets": [{"expr": "sum by (tenant) (rate(llm_tokens_input_total[5m]) + rate(llm_tokens_output_total[5m]))"}],
{"type": "stat", "title": "MTD Spend", "targets": [{"expr": "sum(llm_cost_usd_total)"}]}
]
}
Export/Report APIs (OpenAPI)
openapi: 3.0.3
info: { title: Usage Export API, version: 1.0.0 }
paths:
/exports/usage:
get:
parameters:
- in: query
name: from
schema: { type: string, format: date-time }
- in: query
name: to
schema: { type: string, format: date-time }
- in: query
name: tenant
schema: { type: string }
- in: query
name: cursor
schema: { type: string }
responses:
'200':
description: CSV stream
app.get('/exports/usage', async (req, res) => {
res.setHeader('Content-Type', 'text/csv')
const rows = await queryUsage(req.query)
for (const r of rows) res.write(toCsv(r) + '\n')
res.end()
})
Replay Sandbox Design
- Isolated environment with read-only data mirrors
- Replay traces by
request_idusing stored prompts (hashed lookup + secure vault for QA set) - Compare outputs and metrics; no external tool calls
export async function sandboxReplay(id: string){
const rec = await vault.get(id) // secure
const res = await localModel(rec)
return diff(rec.expected, res)
}
Multi-Region DR Metrics and Failover
graph TD
US[us-east] -- active --> GW1
EU[eu-west] -- standby --> GW2
GW1 --> OTEL1
GW2 --> OTEL2
alerts:
- name: RegionFailoverNeeded
expr: probe_success{region="us-east"} == 0
for: 5m
labels: { severity: page }
Failover: flip DNS or global LB; warm caches; verify metrics parity.
SLA Reporting
select date_trunc('day', ts) as day,
1 - (sum(case when status = 'error' then 1 else 0 end)::float / count(*)) as availability
from llm_usage
where tenant = $1 and ts >= now() - interval '30 days'
group by 1 order by 1;
Data Contracts for Telemetry
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "LlmUsage",
"type": "object",
"required": ["ts","request_id","tenant","model","tokens_in","tokens_out","cost_usd","latency_ms","status"],
"properties": {
"ts": { "type": "string", "format": "date-time" },
"request_id": { "type": "string", "pattern": "^[0-9a-f-]{36}$" },
"tenant": { "type": "string" },
"model": { "type": "string" },
"tokens_in": { "type": "integer", "minimum": 0 },
"tokens_out": { "type": "integer", "minimum": 0 },
"cost_usd": { "type": "number", "minimum": 0 },
"latency_ms": { "type": "integer", "minimum": 0 },
"status": { "type": "string", "enum": ["ok","error"] }
},
"additionalProperties": false
}
API Rate Limit Analytics
select tenant,
sum(case when http_status = 429 then 1 else 0 end) as throttled,
count(*) as total,
sum(case when http_status = 429 then 1 else 0 end)::float / count(*) as rate
from api_gateway_logs
where ts >= now() - interval '7 days'
group by 1 order by rate desc;
Backfill Procedures
- Identify gaps via
min(ts)and continuity checks - Re-ingest from raw logs; idempotent inserts by
request_id - Validate counts and aggregates post-backfill
Team Scorecards
team,win_rate,latency_p95_ms,cost_per_1k_tokens
platform,0.74,240,0.009
support,0.69,310,0.011
Extended FAQ (201–260)
-
How to validate OpenAPI for export endpoints?
Use spectral; CI gate on errors. -
Should replay be full-fidelity?
Close—no external tools; fixed seeds; comparable outputs. -
Measuring cost per feature?
Route labels; aggregate tokens and cost. -
Handling late-arriving events?
Watermarks in ETL; upserts in models. -
Data drift in metrics?
Contracts + tests; alerts on schema changes. -
Out-of-order traces?
Use span timestamps; tolerate skew. -
Data warehouse choice?
Pick existing stack; ensure concurrency. -
KPI roll-ups?
Per day/week/month; moving averages. -
On-call dashboards?
Minimal; p95, error, cost, provider status. -
Token inflation?
Detect tokenizer changes; normalize. -
Click-through in chat UIs?
Custom events; correlate to quality. -
Data egress limits?
Export throttling; async jobs. -
Deduping events?
Primary key on request_id; idempotent writes. -
Regional sampling?
Higher at edges; tailor to volume. -
Bill shocks?
Budget guards + alerting + throttles. -
Usage caps?
Hard caps per tenant; configurable. -
Missing Langsmith traces?
Fallback to OTEL; reconcile by request_id. -
Schema versioning?
Addschema_version; handle in ETL. -
Report delivery?
S3 pre-signed URLs; email notification. -
Forecasting errors?
Confidence intervals; conservative plans. -
Embed dashboards?
Read-only tokens; scoped views. -
Access control for exports?
SSO; per-tenant; audit logs. -
Audit trail completeness?
Cross-check volumes across systems. -
Multi-cloud telemetry?
Unify labels; merge in warehouse. -
Cost anomalies at night?
Bot traffic; rate limits; schedule blockers. -
Token spikes by model?
Check prompts; trim; caching. -
Eval/obs integration?
Add eval scores as metrics; correlate. -
Business KPIs feed?
Join tables; product analytics. -
ROI tracking?
Cost per success; trend downwards. -
Trace tail length?
Cap spans; store key events. -
Live vs batch?
Both; live for alerts, batch for reports. -
Exports format?
CSV, Parquet; schema stable. -
Data freshness SLO?
<15m for dashboards; <1h for finance. -
Multi-tenant fairness on dashboards?
Normalize by plan tiers. -
Privacy filters for exports?
Hash IDs; omit content. -
Incident tagging in data?
Event table; join on request_id. -
Tool cost share?
Tag spans; allocate cost. -
P95 vs P99?
Track both; page on P95. -
New model adoption KPI?
% traffic; quality and cost deltas. -
Backpressure metrics?
Queue depth; 429 rate. -
Warehouse partitioning?
Partition by day; cluster by tenant. -
Cold starts?
Warmers; track first-token latency. -
Cached response attribution?
Mark cached; cost near-zero. -
Long traces storage?
Exemplars or compressed blobs. -
Dashboard permissions?
Org-level roles; audit. -
ETL idempotency?
Upserts; checksum validation. -
SLA reporting to customers?
Share dashboards; monthly reports. -
Cost per 1k tokens trend?
Aim downward; action plans. -
Timezone handling?
Store UTC; display local. -
Dirty data?
Quarantine and fix; document incidents. -
Vendor API changes?
Contract tests; error spikes. -
Cache eviction policy?
LRU + TTL; monitor hit rate. -
Incident hindsight bias?
Use data; avoid speculation; blameless. -
Rollup latencies?
Promise lower precision; faster queries. -
Request replay privacy?
Anonymize; consent; secure vault. -
Multi-tenant quotas on exports?
Rate limit; paginate; async jobs. -
Docs for observability-as-code?
README + examples; PR templates. -
Cost attribution disputes?
Logs + traces as evidence. -
Tool adoption metric?
Usage per route and success rate. -
When to re-architect telemetry?
If costs/latency scale poorly; simplify pipelines.
Provider Exporters (Datadog, CloudWatch)
exporters:
datadog:
api:
site: datadoghq.com
key: ${DATADOG_API_KEY}
awsemf:
namespace: LLM/Gateway
log_group_name: "/aws/otel/llm"
region: us-east-1
Infra-as-Code: Prometheus/Grafana/Loki (Helm)
# values-prom.yaml
prometheus:
server:
retention: 15d
resources:
requests: { cpu: 1, memory: 2Gi }
limits: { cpu: 2, memory: 4Gi }
grafana:
adminPassword: ${GRAFANA_PASSWORD}
persistence: { enabled: true, size: 10Gi }
loki:
config:
table_manager:
retention_deletes_enabled: true
retention_period: 720h
helm upgrade --install obs-stack grafana/tempo \
-f values-prom.yaml -n observability --create-namespace
Terraform: Managed Grafana + Prometheus
resource "aws_grafana_workspace" "llm" {
name = "llm-ops"
account_access_type = "CURRENT_ACCOUNT"
authentication_providers = ["AWS_SSO"]
}
resource "aws_prometheus_workspace" "llm" {
alias = "llm-metrics"
}
SLO Multi-Window Burn Calculators
alerts:
- alert: FastBurn
expr: (1 - sum(rate(llm_generate_success_total[5m])) / sum(rate(llm_generate_total[5m]))) > (1 - 0.999) * 14.4
for: 5m
labels: { severity: page }
- alert: SlowBurn
expr: (1 - sum(rate(llm_generate_success_total[1h])) / sum(rate(llm_generate_total[1h]))) > (1 - 0.999) * 6
for: 2h
labels: { severity: page }
PromQL Query Examples
# p95 latency by model
histogram_quantile(0.95, sum by (le, model) (rate(llm_generate_latency_seconds_bucket[5m])))
# tokens/sec throughput
sum by (model) (rate(llm_tokens_input_total[1m]) + rate(llm_tokens_output_total[1m]))
# cost per tenant per minute
sum by (tenant) (rate(llm_cost_usd_total[1m]))
SQL Query Examples (Warehouse)
-- Top 10 costly prompts (by template)
select template_id, sum(cost_usd) as cost
from llm_usage where ts >= now() - interval '7 days'
group by 1 order by cost desc limit 10;
-- Latency distribution per model
select model, percentile_cont(0.95) within group (order by latency_ms) as p95
from llm_usage where ts >= now() - interval '1 day'
group by 1;
Loki Query Examples (Logs)
{app="llm-gateway"} |= "error" | json | line_format "{{.request_id}} {{.message}}"
{app="llm-gateway"} | json | unwrap latency_ms | quantile_over_time(0.95, 5m)
Synthetic Probe Scheduler
import cron from "node-cron"
const PROBES = [
{ name: "summary", prompt: "Summarize: ..." },
{ name: "extraction", prompt: "Extract JSON: ..." }
]
cron.schedule("*/10 * * * *", async () => {
for (const p of PROBES) {
const t0 = Date.now()
const r = await generate(p.prompt)
recordProbe({ name: p.name, ok: !!r, ms: Date.now()-t0 })
}
})
Cache Observability
const hits = new client.Counter({ name: "llm_cache_hits_total", help: "cache hits" })
const misses = new client.Counter({ name: "llm_cache_misses_total", help: "cache misses" })
export function cacheGet(key: string){ const v = cache.get(key); (v?hits:misses).inc(); return v }
Backlog/Queue Metrics
const qDepth = new client.Gauge({ name: "llm_queue_depth", help: "queue depth" })
const qWait = new client.Histogram({ name: "llm_queue_wait_seconds", help: "queue wait", buckets: [0.01,0.05,0.1,0.2,0.5,1,2] })
Token-Per-Second Throughput
const tps = new client.Gauge({ name: "llm_tokens_per_second", help: "tokens/sec" })
setInterval(() => { tps.set(tokensProducedLastInterval / intervalSeconds) }, 1000)
On-Call Playbook Decision Trees
Latency Spike?
- Provider healthy?
- No: switch route → smaller model
- Yes: Queue depth high?
- Yes: scale pods; check batching
- No: Reranker slow? reduce k
Incident Templates
Title: P1 Latency Degradation
Timeline: 10:00 start, 10:25 mitigation, 10:35 resolved
Impact: 12% requests > 1s
Root Cause: reranker model deploy with batch misconfig
Actions: revert config; add pre-deploy load test; update dashboard
Retention Configs
logs_retention_days: 30
traces_retention_days: 7
metrics_retention_days: 395
privacy:
hash_ids: blake3
redact_content: true
Cost Guard Scripts
#!/usr/bin/env bash
set -euo pipefail
TENANT=${1}
LIMIT=${2}
SPENT=$(psql -tA -c "select coalesce(sum(cost_usd),0) from llm_usage where tenant='${TENANT}' and ts >= date_trunc('month', now())")
awk -v s="$SPENT" -v l="$LIMIT" 'BEGIN{ if (s>l) { print "exceeded"; exit 1 } else { print "ok" } }'
Extended FAQ (261–340)
-
Metric cardinality control?
Avoid high-cardinality labels like request_id; use exemplars. -
Separate read/write Prometheus?
Use remote-write for long-term; local for fast queries. -
Grafana alerting or Alertmanager?
Prefer Alertmanager for complex routing. -
How to avoid costly joins in warehouse?
Partition and cluster; pre-aggregate with dbt. -
Synthetic probe frequency?
Every 10 minutes baseline; increase for critical flows. -
Cache metrics target?
Hit rate > 60% for repetitious queries. -
How to store replay corpora safely?
Encrypted vault; limited access; expiry policies. -
Cross-team visibility?
Dashboards per team with shared core views. -
Loading dashboards as code?
Provision via Grafana API; version in Git. -
Backfill windows?
Keep to 7–30 days; communicate with finance. -
Multi-tenant billing accuracy?
Cross-check with API logs; reconcile differences. -
Tokens vs characters?
Always tokens for cost; characters for UX only. -
Kafka vs Kinesis?
Use whatever your org supports; focus on schema and SLAs. -
Tracing overhead?
Sample; minimal attributes; avoid heavy logs. -
Prompt variants tracking?
Template IDs with versions; correlate to metrics. -
GPU queue depth?
Instrument; shed load before saturation. -
Cost by route?
Label spans with route; aggregate. -
Can we skip Langsmith?
Optional; OTEL baseline works fine. -
P95 vs p50?
Track both; P95 for paging. -
Data duplication across systems?
Yes; reconcile with IDs; accept some duplication. -
Real-time budgets?
Enforce per request and per minute. -
Overcount tokens?
Validate against provider; fix logic. -
Heterogeneous models?
Normalize metrics to per-1k tokens. -
Cost per success?
Key KPI; optimize routing. -
Errors without traces?
Add logging; ensure sampling picks errors. -
Columnar storage?
Prefer for analytics; Parquet. -
Shipping logs direct to SIEM?
Use OTEL logs or forwarders; consider cost. -
Are traces necessary for RAG?
Yes for visibility into retrieval and reranking. -
Alert flapping?
Hysteresis; for durations; smoothing. -
Budget alerts noise?
Daily rollups; alert on deltas. -
Cost per tenant fairness?
Normalize by plan; enforce quotas. -
Storage costs?
Retention tuning; compress; glacier tiers. -
On-call fatigue?
Rotate; automate; refine alerts. -
Auto ticket creation?
Yes for P3; pager for P1/P2. -
Workflow to fix high latency?
Reranker tuning, batch adjust, model switch, cache warm. -
Failover metrics parity?
Compare both regions; alert on drift. -
Customer-facing status pages?
Update SLAs and incidents; transparency. -
Prompt registry drift?
Diffs in CI; alerts on volume changes. -
PCI/PII compliance?
Hash data; segregate; limit retention. -
Onboarding new teams?
Templates, dashboards, budgets; guardrails. -
Data model changes?
Contracts; deprecation plan; migration scripts. -
Burndown of incidents?
Track MTTR/MTTD trends; aim downwards. -
SLI reviews?
Monthly; adjust/retire; align to business. -
Logs dropping?
Backpressure and retry; monitor loss rate. -
Trace sampling envs?
Higher in staging; lower in prod with tail sampling. -
Export formats?
CSV for finance; Parquet for analytics. -
Model-specific dashboards?
Yes for top models; shared core for all. -
Token throttling impact?
Watch success rate; degrade gracefully. -
Response truncation detection?
Count stop reasons; track. -
Security metrics integration?
Include guardrail counters; link to SIEM. -
SLA breaches root cause?
Postmortems with data; actions. -
ETL ownership?
Data team; on-call rotations. -
Warehouse SLAs?
Set for report freshness; monitor. -
Synthetics vs canaries?
Both; canaries on prod traffic subset. -
Visualization sprawl?
Curate; archive; lint dashboards. -
Tool latency breakdown?
Span per tool; aggregate. -
Multi-cloud cost view?
Normalize; tags; combined dashboards. -
Real-time ETL?
Stream processors; limited state; summarize. -
Escalation policy?
Pager, then on-call lead, then incident commander. -
Quarantine noisy tenants?
Throttles; isolation; communication. -
Budget resets automation?
Cron + API; notify owners. -
Data catalog?
Schemas in repo; docs site. -
Validating dashboards?
Snapshot tests; API checks. -
Internal SLAs?
Between platform and product teams. -
Provider migrations?
Shadow period; double instrumentation. -
Logs vs metrics retention?
Longer for metrics; logs are expensive. -
Alert audit trail?
Store notifications; links to incidents. -
New post types?
A/B; track metrics improvements. -
ETL late data window?
Define; handle with upserts. -
Usage cost caps per day?
Yes; throttle when near cap. -
Policy-driven sampling?
Higher for risky routes; lower for safe ones. -
UI performance?
Measure TTI for chat; optimize streaming. -
Cross-team SLIs?
Shared definitions; consistent metrics. -
Unit economics?
Cost per solved ticket; per conversation. -
Export to finance tools?
CSV/ETL; ownership and cadence. -
Are histograms necessary?
Yes for latency percentiles. -
Logs encryption?
At rest and in transit. -
Dashboard permissions drift?
Audit and reconcile monthly. -
SRE buy-in?
Show incident reduction; low toil. -
When to call observability done?
When incidents are rare, cheap, and quickly resolved.
Provider Dashboards (Datadog / New Relic)
# Datadog dashboard JSON (snippet)
widgets:
- definition:
title: "LLM p95 Latency"
type: timeseries
requests:
- q: "histogram_quantile(0.95, sum:llm_generate_latency_seconds.bucket{*} by {le})"
- definition:
title: "Cost/min"
type: query_value
requests:
- q: "sum:rate(llm_cost_usd_total[1m])"
// New Relic NRQL examples
{"query": "SELECT percentile(latencyMs,95) FROM LlmGenerate TIMESERIES 1 minute"}
{"query": "SELECT sum(costUsd) FROM LlmGenerate FACET tenant SINCE 1 day ago"}
Tracing Baggage and Links
import { propagation } from "@opentelemetry/api"
const baggage = propagation.createBaggage({ tenant: { value: tenantId }, route: { value: routeName } })
const ctx = propagation.setBaggage(context.active(), baggage)
await context.with(ctx, async () => { await generateWithTrace(req) })
// Link retriever span to generator span
span.addLink({ context: retrieverSpan.spanContext(), attributes: { role: "retrieval" } })
eBPF Telemetry (CPU/GPU)
# bpftrace snippet for CPU syscall latency
bpftrace -e 'kprobe:sys_enter_write { @cnt = count(); }'
node_exporter:
enabled_collectors:
- textfile
- cpu
- diskstats
- nvidia_gpu
Data Quality Checks for Telemetry
# dq_checks.py
rules = [
(lambda r: r["tokens_in"] >= 0, "tokens_in >= 0"),
(lambda r: r["latency_ms"] < 60000, "latency < 60s"),
(lambda r: r["status"] in ("ok","error"), "status enum")
]
python dq_checks.py --input llm_usage_*.jsonl --fail-on 0.01
CI Gates for Dashboards and Alerts
# validate dashboards json
npx grafana-dashboard-validator dashboards/*.json
# validate alert rules
amtool check-config alertmanager.yml
Tenancy Governance Reports
select tenant, sum(cost_usd) as cost, sum(tokens_in+tokens_out) as tokens,
avg(latency_ms) as avg_latency
from llm_usage where ts >= date_trunc('month', now())
group by 1 order by cost desc;
Rollback Metrics
span.addEvent("rollback", { from: "model-x:1.2.0", to: "model-x:1.1.9", reason: "latency" })
-- Measure post-rollback recovery
select avg(latency_ms) from llm_usage where ts >= now() - interval '2 hours';
RUM for Chat UIs
// Web vitals and streaming start time
const t0 = performance.now()
const source = new EventSource('/api/stream')
source.addEventListener('message', () => {
rum.track('ttft', performance.now() - t0)
source.close()
})
Correlating with SIEM Events
// Splunk/ELK correlation
index=llm app="gateway" | eval minute=_time - (_time % 60)
| stats count by minute
| join minute [ search index=siem sourcetype=guardrail ]
Managed Providers Notes
- Consider managed observability: Datadog, New Relic, Grafana Cloud
- Trade-offs: vendor lock-in vs speed; export raw data to your warehouse
Extended FAQ (341–420)
-
Should we store baggage attributes?
Only if privacy-safe; use hashes for PII. -
GPU metrics granularity?
1s resolution is enough; avoid high-cardinality labels. -
Are tail-based samplers worth it?
Yes—capture slow/error traces without huge volume. -
How to test dashboards in CI?
Lint JSON, snapshot tests, and API validation. -
Detect missing telemetry?
Heartbeats and canaries; alert on gaps. -
Time-to-first-token metric?
Record separately; key for UX. -
Token/sec vs req/sec?
Track both for throughput understanding. -
Jitter in traces?
Sync NTP; tolerate small skews. -
Data quality SLO?
<0.5% invalid records in warehouse. -
Multiple providers?
Normalize labels; avoid duplication in budgets. -
Can we hide prompts?
Hash and store in vault for QA only. -
Credits for SLA?
Data-driven; automate reports. -
Post-rollback checks?
Compare p95 and error rate vs baseline. -
Who owns dashboards?
Platform SRE, with product-specific views by teams. -
ETL governance?
Data team; documented contracts and owners. -
Alert storm handling?
Silence non-critical; raise sampling; focus on P1. -
Logs vs spans for errors?
Both; spans for context, logs for searchability. -
What breaks PromQL?
High cardinality, missing metrics, bad label joins. -
Budget segmentation?
Per-tenant and per-feature budgets. -
Cache cold start measurement?
TTFT spikes; log warmup events. -
Synthetic users?
Tag and exclude from business KPIs. -
A/B observability?
Variant labels across all metrics and traces. -
Infrastructure contention?
Track CPU steal, IO wait; autoscale. -
Multi-region time drift?
Use UTC and synced clocks. -
Late-arrival guard?
ETL windows and idempotent upserts. -
Grafana folders?
Organize by domain; restrict permissions. -
Loki retention tuning?
Balance cost vs need; use compaction. -
Warehouse cost control?
Partition, cluster, and query caching. -
Security correlation?
Join guardrail events to LLM usage. -
Tracing log sampling?
Sample structured logs linked to traces. -
Traffic spikes detection?
EWMA thresholds; auto scale pre-warm. -
Model routing drift?
Track route distribution over time. -
Token accounting precision?
Use vendor SDK tokenizers. -
Dropped spans?
Monitor span loss; adjust batch sizes. -
Tooltip overload in dashboards?
Simplify panels; documentation. -
Indexing strategies in warehouse?
Cluster by tenant/model; materialized views. -
Annotation usage?
Mark deploys and incidents on charts. -
Open source vs managed?
Start open; consider managed at scale. -
Data residency in telemetry?
Route by region; separate workspaces. -
Pre-prod environments?
Mirror instrumentation; separate data stores. -
Ratio alerts?
Safer than absolute counts; less noisy. -
Backup strategies?
Snapshots; DR rehearsals; restore objectives. -
UX metrics for chat?
TTFT, tokens/sec, completion rate. -
Cost KPIs?
$/1k tokens and $/success. -
Grafana annotations API?
Use for deploy markers tied to commits. -
Traces in PRs?
Attach performance results to PR checks. -
Alert escalations?
Time-based; P1 pager then on-call lead. -
Multi-team budgets?
Split by tags; enforce with guards. -
Fine-grained access?
Row-level policies for tenant data. -
What if warehouse down?
Queue events; backfill later. -
Reranker visibility?
Separate metrics; correlate with success. -
Golden set drift?
Refresh monthly; add incident cases. -
Reliable TTFT capture?
Client-side RUM plus server timestamp. -
Token mismatch vs provider?
Reconcile; update mapping. -
Dashboards mobile-friendly?
Minimal key panels; alerts to mobile. -
Contracts violations?
Block merges; notify owners. -
Multi-cloud exporters?
Abstract; use OTEL as base. -
Consistency across services?
Shared libraries; lint metrics. -
Platform migration?
Double-write; verify parity. -
Observability onboarding?
Templates; docs; office hours. -
How much to sample?
Start head 20% + tail; tune. -
Is p99 needed?
For SLA edges; beware noise. -
Detect silent truncation?
Track stop reasons; compare token targets. -
Currency conversion for cost?
Apply FX in reports; document source. -
Per-tenant SLAs?
Possible; isolate and monitor separately. -
Ownership of incidents?
Incident commander; product + platform SMEs. -
KPI gamification risks?
Avoid vanity; align to outcomes. -
Dark launches?
Shadow traces; compare silently. -
Long-term trends?
Seasonality charts; rolling means. -
Automated runbooks?
Bots suggest actions based on signals. -
RAG-specific SLOs?
Recall/groundedness floors. -
Multi-embedding visibility?
Label versions; compare recall. -
Alert duree?
Use for clauses; avoid flapping. -
Sub-minute metrics?
Use when necessary; cost trade-offs. -
Customer health scores?
Composite: latency, quality, cost. -
Heterogeneous pricing?
Normalize to per-1k tokens. -
BigQuery vs Redshift?
Pick based on org; both fine. -
Arbitrary drill-down?
Explore with traces; filter by labels. -
Metric streams to BI?
Export; low-frequency aggregates. -
Closing the loop?
Feed metrics into routing and budgeting.