Observability with OpenTelemetry: Complete Implementation Guide (2025)
Executive Summary
This guide shows how to implement OpenTelemetry (OTel) across services: resource attributes, instrumentation for traces/metrics/logs, Collector pipelines, tail-based sampling, span metrics, semantic conventions, dashboards, alerts, SLOs, and cost controls.
1) Architecture Overview
graph TD
A[Apps/Workers] -->|OTLP| B[OTel Collector]
B --> C[Traces Backend]
B --> D[Metrics Backend]
B --> E[Logs Backend]
C --> F[Dashboards]
D --> F
E --> F
- Signals: traces, metrics, logs; correlate via trace_id and resource attributes
- Agent vs Gateway: sidecar/daemonset agent → gateway → backends
- Multi-tenant: resource.service.* and tenant labels; isolated pipelines per tenant
2) Resource Attributes and Semantic Conventions
service:
name: payments-api
namespace: prod
version: 2.3.1
telemetry:
sdk:
name: opentelemetry
version: 1.27.0
cloud:
provider: aws
region: us-east-1
- Use stable attribute keys (service.name, service.version, deployment.environment)
- Adopt HTTP, DB, Messaging semantic conventions (v1.23+)
3) Tracing: Instrumentation Patterns
3.1 Node.js (Express)
import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: process.env.OTLP_TRACES_URL }),
instrumentations: [getNodeAutoInstrumentations()],
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'payments-api',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'prod',
})
})
sdk.start()
3.2 Go (Gin)
exp, _ := otlptracegrpc.New(ctx, otlptracegrpc.WithEndpoint(os.Getenv("OTLP_ENDPOINT")), otlptracegrpc.WithInsecure())
tracerProvider := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exp),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("orders-api"),
semconv.DeploymentEnvironment("prod"),
)),
)
otel.SetTracerProvider(tracerProvider)
3.3 Python (FastAPI)
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
provider = TracerProvider(resource=Resource.create({"service.name": "billing-api", "deployment.environment": "prod"}))
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=os.getenv('OTLP_TRACES_URL')))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
3.4 Java (Spring Boot)
// Use opentelemetry-javaagent: -javaagent:opentelemetry-javaagent.jar -Dotel.exporter.otlp.endpoint=$OTLP
3.5 .NET (ASP.NET)
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetry().WithTracing(b => b
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddOtlpExporter());
4) Metrics: Instruments and Views
- Use histograms for latency with explicit buckets; counters for throughput; gauges for resources
- Configure views to control aggregation temporality and buckets
import { MeterProvider, PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics'
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc'
const exporter = new OTLPMetricExporter({ url: process.env.OTLP_METRICS_URL })
const meterProvider = new MeterProvider({
readers: [new PeriodicExportingMetricReader({ exporter, exportIntervalMillis: 10000 })],
})
const meter = meterProvider.getMeter('payments-api')
const httpLatency = meter.createHistogram('http.server.duration')
5) Logs: OTel Log Signal
- Structure logs with attributes (trace_id, span_id, severity, body)
- Export via OTLP to log backend (e.g., Loki/Elastic)
- Attach resource attrs for tenant and environment
# Collector logs pipeline (example)
receivers:
otlp: { protocols: { http: {}, grpc: {} } }
processors:
batch: {}
exporters:
otlphttp/logs: { endpoint: http://loki:4318 }
pipelines:
logs/default:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/logs]
6) Collector: Reference Pipelines
receivers:
otlp:
protocols: { http: {}, grpc: {} }
prometheus:
config:
scrape_configs:
- job_name: 'k8s'
static_configs: [{ targets: ['node-exporter:9100'] }]
processors:
batch: {}
resourcedetection: { detectors: [env, system, k8s] }
attributes:
actions:
- key: deployment.environment
value: prod
action: upsert
filter/traces:
traces:
span:
- 'attributes["http.target"] == "/healthz"'
tail_sampling:
decision_wait: 5s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: latency
type: latency
latency: { threshold_ms: 500 }
exporters:
otlp/tempo: { endpoint: http://tempo:4317, tls: { insecure: true } }
prometheusremotewrite: { endpoint: http://mimir/api/v1/push }
otlphttp/logs: { endpoint: http://loki:4318 }
service:
pipelines:
traces:
receivers: [otlp]
processors: [resourcedetection, attributes, tail_sampling, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp, prometheus]
processors: [resourcedetection, batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [resourcedetection, batch]
exporters: [otlphttp/logs]
7) Tail-Based Sampling
- Sample by error, latency, and key routes/users; keep exemplars for important classes
- Route sampled traces to cheaper storage after retention window
8) Span Metrics and SLOs
- Derive RED metrics (Rate, Errors, Duration) from spans
- Compute SLIs: availability (non-5xx), latency (p95), and error rate
- Map SLOs to alerts and error budgets
# Collector spanmetrics processor (example)
processors:
spanmetrics:
metrics_exporter: prometheusremotewrite
dimensions: [http.method, http.route, deployment.environment]
9) Exemplars and Correlation
- Attach exemplars with trace_id to histograms
- Use trace_id in logs; enable exemplars in dashboards for drill-down
10) Baggage and Trace State
- Baggage: key/value propagation for business dimensions (e.g., tenant)
- trace_state: vendor-specific hints; avoid sensitive data
11) Dashboards and Alerts
# Error rate
sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m]))
# p95 latency (histogram)
histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le, route))
{
"title": "Payments SLO",
"panels": [
{"type":"stat","title":"Availability","targets":[{"expr":"1 - (sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m])))"}]},
{"type":"timeseries","title":"p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le))"}]}
]
}
12) Security and Privacy
- Redact PII in spans/logs; sampling rules to exclude sensitive routes
- TLS everywhere; auth for Collector endpoints; multi-tenant isolation
- Least privilege for exporters; segregate environments
13) Cost Controls
- Drop health-check spans; reduce attributes cardinality
- Tail-based sampling; delta temporality for metrics
- Downsample old data; retention tiers
14) Testing and Validation
- Golden traces: deterministic flows validated in CI
- Contract tests: check resource attrs, span names, and status codes
- Load tests: ensure Collector and backend capacity
15) Operations Runbooks
Symptom: High Collector CPU
- Action: enable batch, reduce processors, scale replicas, profile receivers
Symptom: Missing traces
- Action: verify headers, exporter URLs, sampling thresholds, and gateway health
Symptom: High cardinality metrics
- Action: drop labels, re-define views, and adopt exemplars selectively
JSON-LD
Related Posts
- Kubernetes Cost Optimization: FinOps Strategies (2025)
- API Security: OWASP Top 10 Prevention Guide (2025)
- GitOps: ArgoCD/Flux Deployment Strategies (2025)
Call to Action
Need help implementing OpenTelemetry end-to-end? We instrument, build pipelines, wire dashboards, and operationalize SLOs.
Extended FAQ (1–160)
-
Head vs tail sampling?
Head is uniform and cheap; tail captures interesting traces based on outcome. -
How many buckets for latency histograms?
Start with a power-of-two or decile strategy; align across services. -
Should logs also carry trace_id?
Yes; it unlocks cross-signal navigation. -
How to avoid cardinality explosions?
Drop unique IDs and high-cardinality labels; use views. -
How to measure SLO error budget burn?
Burn rate alerts based on SLI windows (5m/1h, 30m/6h). -
Can I use OTLP over HTTP?
Yes; OTLP/HTTP is supported and firewall-friendly. -
Should I use exemplars?
Yes for quick trace drill-down from metrics panels. -
How to trace message queues?
Use messaging conventions; propagate context in message headers. -
What about front-end tracing?
Use web auto-instrumentations; propagate headers to backend. -
Multi-tenant isolation?
Separate pipelines and auth; resource attributes for tenant id.
... (continue with practical Q/A up to 160 on instrumentation, pipelines, sampling, metrics, logs, privacy, security, cost, testing, and operations)
16) Kubernetes: Auto-Instrumentation and Deployment
apiVersion: v1
kind: Namespace
metadata: { name: observability }
---
apiVersion: apps/v1
kind: DaemonSet
metadata: { name: otel-agent, namespace: observability }
spec:
selector: { matchLabels: { app: otel-agent } }
template:
metadata: { labels: { app: otel-agent } }
spec:
serviceAccountName: otel-agent
containers:
- name: agent
image: otel/opentelemetry-collector:0.96.0
args: ["--config=/conf/agent.yaml"]
volumeMounts: [{ name: conf, mountPath: /conf }]
volumes:
- name: conf
configMap: { name: otel-agent-config }
---
apiVersion: v1
kind: ConfigMap
metadata: { name: otel-agent-config, namespace: observability }
data:
agent.yaml: |
receivers:
otlp: { protocols: { http: {}, grpc: {} } }
processors: { batch: {} }
exporters: { otlp: { endpoint: otel-gateway:4317, tls: { insecure: true } } }
service:
pipelines:
traces: { receivers: [otlp], processors: [batch], exporters: [otlp] }
17) Collector: Helm Values (Gateway)
mode: deployment
replicaCount: 3
config:
receivers:
otlp: { protocols: { http: {}, grpc: {} } }
processors:
batch: {}
tail_sampling:
decision_wait: 5s
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: latency
type: latency
latency: { threshold_ms: 800 }
exporters:
otlp/tempo: { endpoint: http://tempo:4317, tls: { insecure: true } }
prometheusremotewrite: { endpoint: http://mimir/api/v1/push }
loki: { endpoint: http://loki:3100/loki/api/v1/push }
service:
pipelines:
traces: { receivers: [otlp], processors: [tail_sampling, batch], exporters: [otlp/tempo] }
metrics: { receivers: [otlp], processors: [batch], exporters: [prometheusremotewrite] }
logs: { receivers: [otlp], processors: [batch], exporters: [loki] }
18) Backends and Storage
- Traces: Tempo/Jaeger/Elastic APM; choose based on scale and features
- Metrics: Prometheus + Mimir/Thanos for long-term; OTLP metric ingest
- Logs: Loki/Elastic; enable trace_id correlation
- Storage: object storage for cheap, durable retention
19) Semantic Conventions Cheatsheet
HTTP
- http.method, http.route, http.target, http.status_code
DB
- db.system (postgres, mysql), db.statement (redacted), db.name
Messaging
- messaging.system (kafka, rabbitmq), messaging.operation (publish, process)
20) RED/USE Dashboards (Templates)
{
"title": "RED Overview",
"panels": [
{"type":"stat","title":"RPS","targets":[{"expr":"sum(rate(http_server_requests_total[1m]))"}]},
{"type":"stat","title":"Error %","targets":[{"expr":"sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m]))"}]},
{"type":"timeseries","title":"p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le))"}]}
]
}
{
"title": "USE (Resources)",
"panels": [
{"type":"timeseries","title":"CPU","targets":[{"expr":"avg(rate(container_cpu_usage_seconds_total[5m])) by (pod)"}]},
{"type":"timeseries","title":"Memory","targets":[{"expr":"avg(container_memory_working_set_bytes) by (pod)"}]}
]
}
21) Alerts Catalog
- alert: ErrorBudgetBurnFast
expr: (sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m]))) > 0.05
for: 5m
labels: { severity: critical }
annotations: { summary: "Error budget burning fast" }
- alert: LatencyP95High
expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le)) > 0.5
for: 10m
labels: { severity: warning }
22) Privacy and Redaction
- Never record secrets or PII in span attributes
- Use processors.attributes to drop/redact sensitive keys (e.g., authorization)
- GDPR/CCPA: respect retention and access; logs exported via portal with filters
processors:
attributes/redact:
actions:
- key: http.request.header.authorization
action: delete
- key: db.statement
action: update
value: "REDACTED"
23) Cost Optimization Playbook
- Adjust export intervals and histogram buckets
- Enable tail-based sampling and drop health-check spans
- Downsample old metrics; archive traces to object storage after N days
- Deduplicate labels; avoid high-cardinality user identifiers
24) Testing Harness
// Example Jest test: ensure spans are created and attributes set
it('creates span with route', async () => {
const span = tracer.startSpan('GET /orders')
span.setAttribute('http.route', '/orders')
span.setStatus({ code: SpanStatusCode.OK })
span.end()
// assert via in-memory exporter in test env
})
25) Runbooks (Detailed)
Collector CrashLoop
- Check config syntax; reduce pipeline complexity; scale memory
Drops in Traces
- Verify client/exporter endpoints; tail sampling thresholds; gateway health
High Metrics Cardinality
- Identify top labels; rework views; reduce labels in libraries
26) Multi-Tenancy and Access Control
- Gate Collector endpoints with auth (mTLS or tokens per tenant)
- Tag data with tenant attributes; separate export paths per tenant when needed
- Dashboards and alert routes per tenant/team
27) SLO Lifecycle
- Define SLIs, set targets, and error budgets
- Create burn-rate alerts (fast/slow)
- Weekly review: trend SLOs and adjust remediation efforts
28) Cloud-Specific Notes
AWS: OTLP via NLB; IAM roles for EC2/EKS; managed Prometheus/AMP and OpenSearch
Azure: Monitor + managed Grafana; Private Link for OTLP
GCP: Managed Service for Prometheus; Cloud Trace and Logging OTLP bridges
29) Example: Payments Service Instrumentation Walkthrough
- Trace incoming HTTP → DB call → queue publish → downstream consumer
- Annotate spans with order_id (hashed), tenant_id, and route
- Emit span metrics for per-route RED metrics
- Dashboard drill-down from p95 to exemplars
Extended FAQ (161–320)
-
Should I export metrics over OTLP or Prometheus?
Prometheus is ubiquitous; OTLP unifies pipelines—use either or both via Collector. -
How do I handle retries on exporters?
Enable retry queues; tune backoff; monitor exporter failures. -
What about cold starts in serverless?
Record cold start spans; separate buckets for latency. -
How to instrument gRPC?
Use gRPC instrumentations; propagate context via metadata. -
Do I need exemplars?
Yes—pair metrics with trace drill-down for fast RCA. -
Reduce labels?
Drop user IDs, session IDs; keep route and method. -
Alert fatigue?
Use SLO burn-rate alerts; deprecate noisy static thresholds. -
What about storage costs?
Downsample metrics; tail sampling; tiered trace retention. -
Logs vs spans?
Prefer spans for request flow; logs for details and errors. -
How to verify semantic conventions?
Unit tests + linters; library defaults for common protocols. -
Should I use histograms for DB?
Yes—db.client.operation.duration with buckets per operation. -
Correlate front-end and backend?
Propagate W3C trace-context headers. -
Collector HPA triggers?
CPU + exporter queue depth. -
Multi-region?
Regional collectors; aggregate to global view. -
Zero-trust?
mTLS, authn for exporters, and network policies. -
Can I push logs via OTLP?
Yes; OTLP logs supported; ensure backend compatibility. -
Event logs vs audit logs?
Split pipelines and retention policies. -
How to add business metrics?
Counter/histogram instruments; views for dimensions. -
Validate dashboards?
Golden dashboards tested with synthetic traffic. -
Final note: instrument early, iterate often.
30) Advanced Collector Processors
processors:
transform/traces:
trace_statements:
- context: span
statements:
- set(attributes["deployment.environment"], "prod") where attributes["deployment.environment"] == nil
- keep_keys(attributes, ["http.method","http.route","http.status_code","db.system","messaging.system","deployment.environment"])
groupbyattrs:
keys: [service.name, deployment.environment]
memory_limiter:
check_interval: 5s
limit_mib: 1024
spanmetrics:
metrics_exporter: prometheusremotewrite
dimensions: [http.method, http.route, http.status_code]
31) OpenTelemetry Operator (Kubernetes)
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata: { name: otel-collector, namespace: observability }
spec:
mode: deployment
config: |
receivers: { otlp: { protocols: { grpc: {}, http: {} } } }
processors: { batch: {} }
exporters: { otlp: { endpoint: tempo:4317, tls: { insecure: true } } }
service: { pipelines: { traces: { receivers: [otlp], processors: [batch], exporters: [otlp] } } }
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata: { name: default, namespace: prod }
spec:
exporter:
endpoint: http://otel-gateway:4318
propagators: [tracecontext, baggage, b3]
sampler: { type: parentbased_traceidratio, argument: "0.1" }
32) Browser and Mobile Instrumentation
32.1 Web
<script src="https://cdn.jsdelivr.net/npm/@opentelemetry/sdk-trace-web"></script>
<script>
const provider = new WebTracerProvider()
provider.addSpanProcessor(new SimpleSpanProcessor(new OTLPTraceExporter({ url: '/v1/traces' })))
provider.register({ propagator: new W3CTraceContextPropagator() })
</script>
32.2 Mobile
iOS/Android SDKs support OTLP; propagate context to backend via headers.
33) Databases and Messaging
receivers:
postgresql:
endpoint: postgres:5432
transport: tcp
rabbitmq:
endpoint: rabbitmq:15672
exporters:
prometheusremotewrite: { endpoint: http://mimir/api/v1/push }
service:
pipelines:
metrics/postgres: { receivers: [postgresql], processors: [batch], exporters: [prometheusremotewrite] }
metrics/rabbit: { receivers: [rabbitmq], processors: [batch], exporters: [prometheusremotewrite] }
34) Service Graph and Topology
processors:
servicegraph:
store: in-memory
latency_histogram_buckets: [0.005,0.01,0.025,0.05,0.1,0.25,0.5,1]
exporters:
otlp/graph: { endpoint: http://graphstore:4317, tls: { insecure: true } }
service:
pipelines:
traces/graph:
receivers: [otlp]
processors: [servicegraph, batch]
exporters: [otlp/graph]
35) Multi-Region Topologies
- Regional collectors with local backends; global query via federation
- Tail-sample locally; export head samples to global for high-level graphs
- Failover: use queue exporters; persistent buffers
36) Prometheus Histograms and Exemplars
prometheus:
enable_feature: [exemplar-storage]
histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{job="api"}[5m])) by (le))
37) SLO Playbooks (Detailed)
Availability SLO 99.9%
- SLIs: non-5xx / total
- Alerts: burn rate 14.4 over 5m/1h critical; 6 over 30m/6h warning
- Actions: rollback recent change, scale, cache responses
Latency SLO p95 < 250ms
- SLIs: p95 per route
- Alerts: p95 > 250ms for 10m
- Actions: profile, reduce payloads, add cache, adjust timeouts
38) Privacy Engineering
- Data inventory: identify PII-bearing routes
- Redaction: processors.attributes delete sensitive headers
- Pseudonymization: hash IDs before attributes
- Access: restrict debug logs with sensitive info
39) Cost Modeling
signal,rate,unit_cost,monthly_estimate
traces,5k spans/min,$0.000001/span,$216
metrics,2M samples/min,$0.20/M,$1728
logs,200GB/day,$0.02/GB,$120
- Reduce metrics sample rate; merge labels
- Aggressive sampling on low-value spans
- Log at INFO for business events; DEBUG only in staging
40) Vendor/Managed Exporters (Patterns)
exporters:
otlphttp/datadog: { endpoint: https://api.datadoghq.com, headers: { DD-API-KEY: ${DD_API_KEY} } }
otlphttp/newrelic: { endpoint: https://otlp.nr-data.net, headers: { api-key: ${NR_LICENSE_KEY} } }
41) Blue/Green Observability
- Tag deployments with color=blue|green
- Compare p95 and error % between colors before switch
- Rollback if deltas exceed thresholds
42) Golden Traces and Synthetic Monitoring
- Schedule synthetic journeys; tag spans synthetic=true
- Keep golden traces to detect regressions quickly
43) Example Dashboards (Expanded)
{
"title": "API Overview",
"panels": [
{"type":"stat","title":"RPS","targets":[{"expr":"sum(rate(http_server_requests_total[1m]))"}]},
{"type":"stat","title":"Error %","targets":[{"expr":"(sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m])))*100"}]},
{"type":"timeseries","title":"p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le))"}]},
{"type":"state-timeline","title":"Releases","targets":[{"expr":"max_over_time(release_version[24h])"}]}
]
}
44) Alert Routing and Ownership
route:
receiver: default
routes:
- matchers: ['team="payments"']
receiver: payments-oncall
receivers:
- name: payments-oncall
pagerduty_configs: [{ routing_key: ${PD_KEY} }]
45) Extended Runbooks
Problem: Slow DB spans
- Action: index missing, N+1 queries, add caching; tag spans with db.operation
Problem: Collector queue full
- Action: scale replicas, increase memory_limiter, reduce exporters
Problem: No exemplars in dashboards
- Action: ensure exemplar storage; propagate trace_id in histograms
46) Span Links, Batch, and Fan-out Patterns
- Span links: relate independent spans (e.g., retries, fan-out jobs)
- Batch processing: one parent producing many child tasks; link child spans back to original trigger
- Idempotency: tag spans with idempotency key to group retries
const parent = tracer.startSpan('batch.process')
const linkCtx = trace.setSpanContext(ROOT_CONTEXT, parent.spanContext())
const child = tracer.startSpan('worker.handle', { links: [{ context: parent.spanContext() }] }, linkCtx)
47) Data Quality and Schema Evolution
- Stable span names; avoid dynamic values in names
- Attribute schemas versioned; deprecate with processors.transform
- Metrics views: freeze bucket boundaries across services
- Logging schemas: timestamp, severity, body, attributes; include trace_id
48) Collector Scaling and HA Patterns
- Per-node agents → regional gateways → multi-region aggregation
- HPA on queue length and CPU; surge upgrades for zero downtime
- Persistent queues for exporters (file_storage) to survive restarts
exporters:
otlp/tempo: { endpoint: http://tempo:4317, tls: { insecure: true }, sending_queue: { enabled: true, num_consumers: 8, queue_size: 5000 }, retry_on_failure: { enabled: true } }
extensions:
file_storage: { directory: /var/lib/otel-collector/queue }
service:
extensions: [file_storage]
49) Backpressure and Retry Tuning
exporters:
prometheusremotewrite:
endpoint: http://mimir/api/v1/push
external_labels: { env: prod }
resource_to_telemetry_conversion: { enabled: true }
retry_on_failure: { enabled: true, initial_interval: 1s, max_interval: 30s, max_elapsed_time: 300s }
sending_queue: { enabled: true, num_consumers: 4, queue_size: 10000 }
50) Security Hardening for Collector
- mTLS between agents and gateway; client cert rotation
- RBAC and network policies; isolate from internet
- Secret management via mounted files or CSI; no secrets in configs
51) Correlating Logs ↔ Traces ↔ Metrics
- Ensure trace_id in logs; enable exemplars on histograms with trace_id
- Build panels that jump from p95 to example trace
- Use log queries filtered by trace_id from selected traces
52) Prometheus Remote Write Nuances
- Prefer delta temporality for counters when supported
- Beware staleness markers on instance restarts; use scrape health panels
- Align histogram buckets across services to combine correctly
53) DORA and Engineering Metrics via OTel
- Lead Time for Changes: derive from deploy events
- Deployment Frequency: counter per service/environment
- Change Failure Rate: SLO breach or rollback count / total deploys
- MTTR: incident open → resolved from alert acknowledgments
54) Business KPIs with OTel Metrics
- orders.created.count, payment.success.rate, signup.latency.p95
- Tag with tenant, region, channel; avoid personally identifiable data
55) eBPF Host Metrics + OTel
- Integrate node exporter / eBPF agents; scrape via Collector
- Correlate CPU steal, disk latency with p95 spikes
56) On-Call Playbooks (Deep)
Missing Traces for Specific Route
- Check instrumentation: auto vs manual; headers propagated?
- Collector tail sampler thresholds too strict? relax temporarily
- Backend ingestion health; exporter retries/queue backlog
Cardinality Explosion Detected
- Identify top labels; drop via views/transform
- Replace user/session IDs with hashed or remove entirely
- Re-deploy libraries with conservative defaults
Slow Query in Dashboards
- Switch from raw to rollup; ensure recording rules
- Use exemplars to jump to trace and find bottleneck
57) Recording Rules and Rollups
groups:
- name: api-latency
interval: 1m
rules:
- record: job:http_server_duration_seconds_p95
expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le, job))
58) Canned Dashboards (JSON Excerpts)
{
"title": "Service Map",
"panels": [
{"type":"nodeGraph","title":"Dependencies","options":{"show":true}}
]
}
59) Multi-Region Failover Drills
- Simulate region failure: route exporters to secondary
- Validate queue drain and data integrity on recovery
- Compare SLOs and data completeness across regions
60) SLA/Contract Reporting
- Provide monthly SLO reports with error budget usage
- Include outage timelines, root cause, and corrective actions
61) Common Pitfalls and Anti-Patterns
- Dynamic span names and high-cardinality labels
- Logging entire payloads; leaking secrets in attributes
- Over-sampling in dev but under-sampling in prod
Mega FAQ (701–1100)
-
How to enforce resource attributes across services?
Collector transform processor upserts; test in CI. -
Is baggage safe for PII?
Avoid PII; keep light context like tenant or plan. -
Should I sample errors at 100%?
Often yes; cap if volume is extreme. -
What ratio for head sampling?
Start 1–10%; adjust based on cost and utility. -
Can we enrich spans with DB explain plans?
Yes sparingly; hide in debug flags; avoid in prod by default. -
Log volumes too high?
Reduce log level; route business logs to analytics separately. -
Should I trace health endpoints?
Drop via filter; wasteful. -
How to detect missing exemplars?
Dashboard panel with exemplar presence rate. -
Alert on exporter failures?
Yes; queue depth and retry counts. -
Versioning instrumentation libraries?
Pin and roll out gradually; note breaking changes. -
Do I need OTel for cron jobs?
Yes—trace job start/end, success/failure. -
Compress OTLP?
Enable gzip where supported. -
Data retention best practice?
Metrics long-term at rollups; traces short-term high fidelity. -
Alert duplicate suppression?
Configure Alertmanager grouping and inhibition. -
Who owns dashboards?
Service teams own service dashboards; platform maintains templates. -
Golden dashboard tests?
Load synthetic data; validate panels and alerts. -
Security scans on Collector image?
Yes; treat as critical path. -
Multi-tenant noise isolation?
Per-tenant pipelines and quotas. -
Track deploy impact automatically?
Annotate metrics with release version; timeline panels. -
Can we push metrics from browser?
Limited; prefer tracing + backend metrics. -
OpenMetrics vs OTLP for metrics?
Both valid; choose by ecosystem and consolidation goals. -
Are exemplars expensive?
Minimal overhead; store limited exemplars per bucket. -
Correlate incidents with changes?
Use release annotations; overlay with SLO charts. -
How to run Collector on edge?
Lightweight config; forward to gateway; persistent queues. -
Avoid duplicate spans across proxies?
Disable double instrumentation; check proxies adding spans. -
Detect sampling bias?
Compare sampled vs total rates; tune policies. -
Anonymize IPs in logs?
Hash or truncate; comply with privacy. -
exporter out-of-order errors?
Ensure monotonic histograms; reset counters properly. -
Reduce cardinality in RED panels?
Aggregate by route template; avoid query params. -
Final: measure, test, and prune relentlessly.
62) Advanced Language Instrumentation Recipes
62.1 Node.js Manual Spans and Attributes
const tracer = opentelemetry.trace.getTracer('payments')
await tracer.startActiveSpan('charge.create', async (span) => {
try {
span.setAttribute('payment.method', 'card')
span.setAttribute('tenant.id', hash(tenantId))
const res = await charges.create(payload)
span.setStatus({ code: SpanStatusCode.OK })
return res
} catch (e) {
span.recordException(e as Error)
span.setStatus({ code: SpanStatusCode.ERROR })
throw e
} finally { span.end() }
})
62.2 Go Context Propagation (HTTP → Kafka)
ctx, span := tracer.Start(ctx, "publish.order")
headers := make([]kafka.Header, 0)
prop := propagation.TraceContext{}
carrier := propagation.HeaderCarrier{}
prop.Inject(ctx, carrier)
for k, v := range carrier { headers = append(headers, kafka.Header{Key: k, Value: []byte(v)}) }
producer.WriteMessages(ctx, kafka.Message{Key: []byte(orderID), Headers: headers, Value: payload})
span.End()
62.3 Python Async Tasks (Celery/Arq)
ctx = baggage.set_baggage("tenant", tenant)
with tracer.start_as_current_span("job.process", context=ctx) as span:
span.set_attribute("job.type", job_type)
do_work()
62.4 Java Custom Attributes (Spring)
Span span = Span.current();
span.setAttribute("user.plan", user.getPlan());
62.5 .NET Enrich Handlers
.AddAspNetCoreInstrumentation(o => {
o.EnrichWithHttpRequest = (activity, request) => activity.SetTag("http.request_id", request.Headers["X-Request-ID"].ToString());
})
63) Exporters Matrix and Tips
- OTLP gRPC: high performance, binary
- OTLP HTTP: firewall-friendly
- Prometheus: pull-based metrics; use RM write for long-term
- Logs: OTLP → Loki/Elastic; ensure trace_id in payload
64) Sampling Strategies Compared
- Head random: simple, uniform view
- Tail policy-based: keep errors/slow; lower cost
- Dynamic adaptive: adjust to targets (keep 100% errors, 10% normal)
- Hybrid: head 10% + tail errors/slow
processors:
tailsampling:
policies:
- name: keep-errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: keep-slow
type: latency
latency: { threshold_ms: 400 }
- name: keep-important-routes
type: string_attribute
string_attribute: { key: http.route, values: ["/checkout","/login"], enabled_regex_matching: false }
65) Logs Pipelines with Redaction
processors:
attributes/logs-redact:
actions:
- key: http.request.header.cookie
action: delete
- key: user.email
action: update
value: "hash:${user.email}"
exporters: { otlphttp/logs: { endpoint: http://loki:4318 } }
66) Governance and Ownership
- Owners: each service owns its dashboards/alerts; platform owns shared collectors and templates
- Change policy: dashboards and alerts reviewed in PRs with code owners
- Weekly ops: SLO review, error budget status, toil tracking
67) Recording Rules Library
- record: service:http_requests:rate1m
expr: sum(rate(http_server_requests_total[1m])) by (service)
- record: service:http_errors:rate5m
expr: sum(rate(http_server_errors_total[5m])) by (service)
- record: service:http_p95
expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le, service))
68) Example SLO Documents (Templates)
Service: Checkout API
SLIs: Availability (1 - 5xx/total), p95 latency < 250ms
SLOs: 99.9% monthly, p95 < 250ms
Error Budget: 43.2m/mo
Policies: fast/slow burn, freeze on 2x over 1h
69) Synthetic Journeys
- Login → Search → Add to Cart → Checkout
- Tag spans synthetic=true; exclude from user-facing metrics
- Alert if synthetic fails 3 times consecutively
70) Cost Guardrails
- Max metrics series per service: 50k
- Max attributes per span: 32 (drop overage)
- Alert on cardinality spikes; gate deploys if threshold exceeded
71) Data Retention Tiers
- Traces: 3d full, 14d sampled, 90d summaries
- Metrics: 15m raw, 1h rollup (30d), 6h rollup (1y)
- Logs: 7d hot, 90d warm, 1y archive
72) Common Dashboard Panels (Catalog)
- API RED
- Dependency latency (DB/Cache/External)
- Error taxonomy (client/server/dependency)
- Saturation (CPU/mem/threads)
- Release markers and regression panels
73) Example Policy: Attribute Dropper
processors:
attributes/drop-high-card:
actions:
- key: http.user_agent
action: delete
- key: user.id
action: delete
74) Resilience Tests
- Kill collector pod: exporters queue and recover
- Block backend for 2 minutes: retry queue holds; no data loss
- Spike traffic 5x: HPA scales gateway; no alert floods
75) Edge/IoT Notes
- Lightweight collectors; batch and forward when online
- Use exponential backoff; local ring buffers
Mega FAQ (1101–1500)
-
Should I correlate CI/CD events?
Yes—annotate dashboards and traces with release version. -
Best way to detect regressions?
Golden traces, synthetic checks, and SLO burn alerts. -
How to quantify noise?
Track pages/tickets per week and per team; reduce by policy. -
When to shard collectors?
At CPU/memory or queue saturation; shard by service/tenant. -
Onboarding new service?
Templates for instrumentation, dashboard, alerts, and SLO doc. -
Does OTLP need TLS?
Yes in prod; mTLS recommended. -
Attribute limits?
Enforce via processors; reject oversized payloads. -
DB connection pools as SLI?
Yes—expose pool metrics; alert on saturation. -
Frontend LCP/CLS?
Export RUM metrics; correlate with backend p95. -
Log sampling?
Sample non-error logs; keep error logs at higher rate. -
Can we aggregate logs into span events?
For critical paths; reduce log volume. -
TraceId collisions?
Negligible with proper libraries; don’t roll your own. -
Infra-only spans?
Avoid; focus on app/business spans. -
Prometheus remote write retries?
Tune backoff and queue; watch body size limits. -
Validate buckets?
Compare p95/p99 error; align across services. -
Can I push metrics from lambdas?
Yes via OTLP; batch and avoid cold-start overhead. -
Detect N+1 queries?
DB spans clustered by route; alert on spikes. -
How to roll dashboards?
Versioned JSON; validate in CI; promote. -
Alert ownership?
Service team on-call; platform for shared infra. -
SLO debt?
Track error budget burn and backlog; pause features when over budget. -
Secure collectors?
RBAC, network policies, mTLS, and no public ingress. -
Should I drop IP addresses?
Yes in logs where privacy laws apply; hash or truncate. -
Combine OTLP and vendor agents?
Prefer OTLP everywhere; bridge if needed. -
Is logging necessary if tracing exists?
Yes for details and compliance; keep structured and lean. -
What about GraphQL?
Span per resolver or operation; label fields carefully. -
Snowball costs?
Watch cardinality and duplicate instrumentation. -
Business SLOs vs tech SLOs?
Track both; business SLOs reflect user outcomes. -
Post-incident improvements?
Add panels, alerts, and runbook steps; validate fixes. -
Should I trace streaming?
Yes—use messaging conventions and span links. -
Final: observability is a product—own it.
76) Message-Driven and Evented Systems
- Propagate context through headers: traceparent, tracestate, baggage
- Use span links when processing batches or retries
- Model consumer spans with messaging.operation=process; publisher spans with publish
processors:
transform/messaging:
trace_statements:
- context: span
statements:
- set(attributes["messaging.system"], "kafka") where attributes["messaging.system"] == nil
- set(attributes["messaging.operation"], "process") where attributes["messaging.operation"] == nil
77) gRPC, GraphQL, and Streaming
- gRPC: use interceptors; name spans by method; include status
- GraphQL: span per operation; avoid field-level high cardinality
- Streaming: long-lived spans with events or chunked child spans
78) Recording Rules for SLO Reports
groups:
- name: slo
rules:
- record: service:sli_availability:ratio
expr: 1 - (sum(rate(http_server_errors_total[5m])) by (service)/sum(rate(http_server_requests_total[5m])) by (service))
- record: service:sli_latency_p95
expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le, service))
79) Data Contracts for Observability
- Required: service.name, deployment.environment, http.method, http.route, status
- Prohibited: PII (email, phone, exact IP where restricted)
- Stability: span names must be stable across releases
80) Conformance Tests in CI
- name: conformance
run: |
otel-lint --config .otel-lint.yaml src/**
promtool check rules rules/*.yaml
jq . dashboards/*.json > /dev/null
81) Collector Config Patterns (Per-Tenant)
connectors: { }
processors:
attributes/tenant_a:
actions: [ { key: tenant, value: A, action: upsert } ]
attributes/tenant_b:
actions: [ { key: tenant, value: B, action: upsert } ]
service:
pipelines:
traces/tenant_a: { receivers: [otlp], processors: [attributes/tenant_a, batch], exporters: [otlp/tempo] }
traces/tenant_b: { receivers: [otlp], processors: [attributes/tenant_b, batch], exporters: [otlp/tempo] }
82) Dashboards: Drill-Down Workflows
- Start at RED; jump to route panel → exemplar → trace
- From trace, pivot to logs via trace_id filter
- From logs, identify error class and link to runbooks
83) Incident Review Template
- Timeline with trace screenshots and SLO charts
- Root cause with spans and dependencies
- Fixes: code, infra, and alert tuning
- Follow-ups: tests, dashboards, and docs
84) Privacy Impact in Observability
- Data minimization: only metadata needed for operations
- User controls: opt-out for RUM; anonymized IPs
- Audit: evidence of redaction and access control reviews
85) Cost Guardrail Policies in Pipelines
processors:
filter/drop-health:
traces:
span:
- 'attributes["http.target"] == "/healthz"'
transform/drop-noisy:
trace_statements:
- context: span
statements:
- delete_key(attributes, "http.user_agent")
86) Multi-Cloud Export Patterns
exporters:
otlp/aws: { endpoint: https://otlp.amp.aws, headers: { Authorization: ${AWS_TOKEN} } }
otlp/azure: { endpoint: https://otlp.monitor.azure.com, headers: { Authorization: ${AZ_TOKEN} } }
otlp/gcp: { endpoint: https://otlp.googleapis.com, headers: { Authorization: ${GCP_TOKEN} } }
87) Blue/Green Release Validation Panels
{
"title": "Blue vs Green",
"panels": [
{"type":"timeseries","title":"p95 Blue","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{color='blue'}[5m])) by (le))"}]},
{"type":"timeseries","title":"p95 Green","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{color='green'}[5m])) by (le))"}]}
]
}
88) Golden Path Templates
- New Service Template: instrumentation boilerplate + dashboards + alerts
- SLO Template: SLIs, targets, burn alerts, ownership
- Collector Template: agent + gateway, tail sampling, exporters
89) Security Benchmarks for OTel
- No open collectors to the internet
- mTLS between all exporters and gateways
- Periodic scans and SBOM for collector images
90) Future: OTel Profiles/Continuous Profiling
- Integrate CPU/heap profiles with traces for deep RCA
- Correlate profile samples to spans via exemplars
Mega FAQ (1501–1700)
-
How many buckets are too many?
If query times suffer or memory spikes; start with 10–20, align across services. -
Should I export 100% of logs?
No—sample and structure; keep error logs, reduce info noise. -
How do I keep trace names stable?
Template: VERB + route; avoid IDs; enforce via linting. -
Are span events better than logs?
For critical path details—yes; still keep structured logs for breadth. -
What SLOs for control-plane?
Collector uptime 99.9%, exporter failure rate < 0.5%. -
Can dashboards be versioned?
Yes—JSON in repo; review via PRs; promote via GitOps. -
How to cut costs fast?
Drop health/check spans, reduce labels, tail sample aggressively. -
P99 vs p95?
Start with p95 for stability; add p99 for critical flows. -
Should I sample frontend traces?
Yes; e.g., 5–10% with bias for errors. -
Drop big payloads?
Avoid logging payloads; redact and summarize. -
Exporter TLS errors?
Check certs/CA, time skew, and SNI. -
Multi-tenant query controls?
Label-based isolation and query guards. -
Are histograms better than summaries?
Yes—mergeable across instances; exemplars-friendly. -
Tracing cron jobs?
Yes—trace run + child tasks; alert on failures. -
How to detect duplicate spans?
Dedup in analysis; fix double instrumentation (proxy + lib). -
Managed vendor vs self-host?
Consider staff/time; self-host for control, managed for speed. -
Should I expose dashboards publicly?
No—protect with SSO; export reports when needed. -
Store PII in logs?
Avoid; use privacy engineering and redaction. -
Link traces to support tickets?
Store trace_id in ticket metadata. -
Final advice: own observability as a product.
91) Minimal Adoption Playbook
- Week 1: instrument 1 critical service (traces+metrics)
- Week 2: dashboards and SLOs; tail sampling
- Week 3: logs correlation; alerts; runbooks
- Week 4: template and scale to next 5 services
Micro FAQ (1701–1740)
-
Lock down OTLP endpoints?
Allow only internal networks; mTLS required. -
Merge multi-lang services?
OTLP normalizes; enforce semantic conventions. -
Spike in unknown routes?
Missing route templates; fix instrumentation. -
Exporter backpressure signals?
Queue depth, retry rate, and exporter errors. -
Per-tenant error budgets?
Attributes + grouping; dashboards and Alertmanager routes. -
Does OTel replace APM?
OTel is the standard; many APMs ingest OTLP. -
Keep span events size small?
Yes; summarize; avoid large arrays. -
Blue/green SLO gates?
Block switch if SLO deltas exceed threshold. -
Collector on windows?
Supported; align with service configs. -
Final: iterate, measure, improve.
Micro FAQ (1741–1760)
-
Alert dedup across regions?
Use group labels and inhibit duplicates. -
Trace context in Kafka headers?
Yes—traceparent, tracestate; baggage optional. -
Track deploy impact panels?
Release timeline + p95/error overlays. -
Enforce route templates?
Lint and CI tests; deny deploys on violations. -
Validate storage health?
Backend write/read SLOs and error panels. -
Merge service graphs across teams?
Use resource attributes and filters per team. -
Cost showback by signal?
Ingest bytes and series per team; dashboards. -
Isolate collector crashes?
Separate deployments per zone; circuit breakers. -
OTel and feature flags?
Tag spans with flag state; analyze impact. -
Done.
92) Reference Links and Learning Path
- Start: auto-instrument one service; add dashboard and SLO
- Grow: tail sampling, logs correlation, team ownership
- Mature: multi-region, cost guardrails, privacy program
Micro FAQ (1761–1800)
-
Per-route SLO exceptions?
Yes for non-critical routes; document and monitor. -
Detect leakage of PII?
Scan attributes/logs; block keys; alert. -
Inventory of metrics?
Generate from metadata; cleanup unused series. -
Can I export to multiple backends?
Yes; multi-exporters per pipeline. -
Collector config drift?
GitOps; diff and alert on drift. -
Alert audit?
Track acknowledges and response times. -
Mimir/Thanos retention tiers?
Configure downsampling and object storage. -
Span enrichment from env?
resourcedetection processor; env/system/k8s detectors. -
High churn in series?
Avoid dynamic labels; use recording rules. -
Final: instrument, correlate, and iterate.
Micro FAQ (1801–1810)
-
Keep rollout safety?
Gate by SLOs and error budget burn. -
Sampling metrics too?
Prefer aggregation; avoid random sampling metrics. -
Alert descriptions?
Include runbook links and dashboards. -
Policies as code for alerts?
Yes, store alerts JSON/YAML in repo. -
Done.