Observability with OpenTelemetry: Complete Implementation Guide (2025)

Oct 27, 2025•

observabilityopentelemetryotelmetrics

•

Executive Summary

This guide shows how to implement OpenTelemetry (OTel) across services: resource attributes, instrumentation for traces/metrics/logs, Collector pipelines, tail-based sampling, span metrics, semantic conventions, dashboards, alerts, SLOs, and cost controls.

1) Architecture Overview

graph TD
  A[Apps/Workers] -->|OTLP| B[OTel Collector]
  B --> C[Traces Backend]
  B --> D[Metrics Backend]
  B --> E[Logs Backend]
  C --> F[Dashboards]
  D --> F
  E --> F

- Signals: traces, metrics, logs; correlate via trace_id and resource attributes
- Agent vs Gateway: sidecar/daemonset agent → gateway → backends
- Multi-tenant: resource.service.* and tenant labels; isolated pipelines per tenant

2) Resource Attributes and Semantic Conventions

service:
  name: payments-api
  namespace: prod
  version: 2.3.1
telemetry:
  sdk:
    name: opentelemetry
    version: 1.27.0
cloud:
  provider: aws
  region: us-east-1

- Use stable attribute keys (service.name, service.version, deployment.environment)
- Adopt HTTP, DB, Messaging semantic conventions (v1.23+)

3) Tracing: Instrumentation Patterns

3.1 Node.js (Express)

import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: process.env.OTLP_TRACES_URL }),
  instrumentations: [getNodeAutoInstrumentations()],
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'payments-api',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'prod',
  })
})

sdk.start()

3.2 Go (Gin)

exp, _ := otlptracegrpc.New(ctx, otlptracegrpc.WithEndpoint(os.Getenv("OTLP_ENDPOINT")), otlptracegrpc.WithInsecure())
tracerProvider := sdktrace.NewTracerProvider(
  sdktrace.WithBatcher(exp),
  sdktrace.WithResource(resource.NewWithAttributes(
    semconv.SchemaURL,
    semconv.ServiceNameKey.String("orders-api"),
    semconv.DeploymentEnvironment("prod"),
  )),
)
otel.SetTracerProvider(tracerProvider)

3.3 Python (FastAPI)

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider(resource=Resource.create({"service.name": "billing-api", "deployment.environment": "prod"}))
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=os.getenv('OTLP_TRACES_URL')))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

3.4 Java (Spring Boot)

// Use opentelemetry-javaagent: -javaagent:opentelemetry-javaagent.jar -Dotel.exporter.otlp.endpoint=$OTLP

3.5 .NET (ASP.NET)

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetry().WithTracing(b => b
  .AddAspNetCoreInstrumentation()
  .AddHttpClientInstrumentation()
  .AddOtlpExporter());

4) Metrics: Instruments and Views

- Use histograms for latency with explicit buckets; counters for throughput; gauges for resources
- Configure views to control aggregation temporality and buckets

import { MeterProvider, PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics'
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc'

const exporter = new OTLPMetricExporter({ url: process.env.OTLP_METRICS_URL })
const meterProvider = new MeterProvider({
  readers: [new PeriodicExportingMetricReader({ exporter, exportIntervalMillis: 10000 })],
})
const meter = meterProvider.getMeter('payments-api')
const httpLatency = meter.createHistogram('http.server.duration')

5) Logs: OTel Log Signal

- Structure logs with attributes (trace_id, span_id, severity, body)
- Export via OTLP to log backend (e.g., Loki/Elastic)
- Attach resource attrs for tenant and environment

# Collector logs pipeline (example)
receivers:
  otlp: { protocols: { http: {}, grpc: {} } }
processors:
  batch: {}
exporters:
  otlphttp/logs: { endpoint: http://loki:4318 }
pipelines:
  logs/default:
    receivers: [otlp]
    processors: [batch]
    exporters: [otlphttp/logs]

6) Collector: Reference Pipelines

receivers:
  otlp:
    protocols: { http: {}, grpc: {} }
  prometheus:
    config:
      scrape_configs:
        - job_name: 'k8s'
          static_configs: [{ targets: ['node-exporter:9100'] }]
processors:
  batch: {}
  resourcedetection: { detectors: [env, system, k8s] }
  attributes:
    actions:
      - key: deployment.environment
        value: prod
        action: upsert
  filter/traces:
    traces:
      span:
        - 'attributes["http.target"] == "/healthz"'
  tail_sampling:
    decision_wait: 5s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: latency
        type: latency
        latency: { threshold_ms: 500 }
exporters:
  otlp/tempo: { endpoint: http://tempo:4317, tls: { insecure: true } }
  prometheusremotewrite: { endpoint: http://mimir/api/v1/push }
  otlphttp/logs: { endpoint: http://loki:4318 }
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resourcedetection, attributes, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [resourcedetection, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [resourcedetection, batch]
      exporters: [otlphttp/logs]

7) Tail-Based Sampling

- Sample by error, latency, and key routes/users; keep exemplars for important classes
- Route sampled traces to cheaper storage after retention window

8) Span Metrics and SLOs

- Derive RED metrics (Rate, Errors, Duration) from spans
- Compute SLIs: availability (non-5xx), latency (p95), and error rate
- Map SLOs to alerts and error budgets

# Collector spanmetrics processor (example)
processors:
  spanmetrics:
    metrics_exporter: prometheusremotewrite
    dimensions: [http.method, http.route, deployment.environment]

9) Exemplars and Correlation

- Attach exemplars with trace_id to histograms
- Use trace_id in logs; enable exemplars in dashboards for drill-down

10) Baggage and Trace State

- Baggage: key/value propagation for business dimensions (e.g., tenant)
- trace_state: vendor-specific hints; avoid sensitive data

11) Dashboards and Alerts

# Error rate
sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m]))

# p95 latency (histogram)
histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le, route))

{
  "title": "Payments SLO",
  "panels": [
    {"type":"stat","title":"Availability","targets":[{"expr":"1 - (sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m])))"}]},
    {"type":"timeseries","title":"p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le))"}]}
  ]
}

12) Security and Privacy

- Redact PII in spans/logs; sampling rules to exclude sensitive routes
- TLS everywhere; auth for Collector endpoints; multi-tenant isolation
- Least privilege for exporters; segregate environments

13) Cost Controls

- Drop health-check spans; reduce attributes cardinality
- Tail-based sampling; delta temporality for metrics
- Downsample old data; retention tiers

14) Testing and Validation

- Golden traces: deterministic flows validated in CI
- Contract tests: check resource attrs, span names, and status codes
- Load tests: ensure Collector and backend capacity

15) Operations Runbooks

Symptom: High Collector CPU
- Action: enable batch, reduce processors, scale replicas, profile receivers

Symptom: Missing traces
- Action: verify headers, exporter URLs, sampling thresholds, and gateway health

Symptom: High cardinality metrics
- Action: drop labels, re-define views, and adopt exemplars selectively

JSON-LD

Kubernetes Cost Optimization: FinOps Strategies (2025)
API Security: OWASP Top 10 Prevention Guide (2025)
GitOps: ArgoCD/Flux Deployment Strategies (2025)

Call to Action

Need help implementing OpenTelemetry end-to-end? We instrument, build pipelines, wire dashboards, and operationalize SLOs.

Extended FAQ (1–160)

Head vs tail sampling?
Head is uniform and cheap; tail captures interesting traces based on outcome.
How many buckets for latency histograms?
Start with a power-of-two or decile strategy; align across services.
Should logs also carry trace_id?
Yes; it unlocks cross-signal navigation.
How to avoid cardinality explosions?
Drop unique IDs and high-cardinality labels; use views.
How to measure SLO error budget burn?
Burn rate alerts based on SLI windows (5m/1h, 30m/6h).
Can I use OTLP over HTTP?
Yes; OTLP/HTTP is supported and firewall-friendly.
Should I use exemplars?
Yes for quick trace drill-down from metrics panels.
How to trace message queues?
Use messaging conventions; propagate context in message headers.
What about front-end tracing?
Use web auto-instrumentations; propagate headers to backend.
Multi-tenant isolation?
Separate pipelines and auth; resource attributes for tenant id.

... (continue with practical Q/A up to 160 on instrumentation, pipelines, sampling, metrics, logs, privacy, security, cost, testing, and operations)

16) Kubernetes: Auto-Instrumentation and Deployment

apiVersion: v1
kind: Namespace
metadata: { name: observability }
---
apiVersion: apps/v1
kind: DaemonSet
metadata: { name: otel-agent, namespace: observability }
spec:
  selector: { matchLabels: { app: otel-agent } }
  template:
    metadata: { labels: { app: otel-agent } }
    spec:
      serviceAccountName: otel-agent
      containers:
        - name: agent
          image: otel/opentelemetry-collector:0.96.0
          args: ["--config=/conf/agent.yaml"]
          volumeMounts: [{ name: conf, mountPath: /conf }]
      volumes:
        - name: conf
          configMap: { name: otel-agent-config }
---
apiVersion: v1
kind: ConfigMap
metadata: { name: otel-agent-config, namespace: observability }
data:
  agent.yaml: |
    receivers:
      otlp: { protocols: { http: {}, grpc: {} } }
    processors: { batch: {} }
    exporters: { otlp: { endpoint: otel-gateway:4317, tls: { insecure: true } } }
    service:
      pipelines:
        traces: { receivers: [otlp], processors: [batch], exporters: [otlp] }

17) Collector: Helm Values (Gateway)

mode: deployment
replicaCount: 3
config:
  receivers:
    otlp: { protocols: { http: {}, grpc: {} } }
  processors:
    batch: {}
    tail_sampling:
      decision_wait: 5s
      policies:
        - name: errors
          type: status_code
          status_code: { status_codes: [ERROR] }
        - name: latency
          type: latency
          latency: { threshold_ms: 800 }
  exporters:
    otlp/tempo: { endpoint: http://tempo:4317, tls: { insecure: true } }
    prometheusremotewrite: { endpoint: http://mimir/api/v1/push }
    loki: { endpoint: http://loki:3100/loki/api/v1/push }
  service:
    pipelines:
      traces: { receivers: [otlp], processors: [tail_sampling, batch], exporters: [otlp/tempo] }
      metrics: { receivers: [otlp], processors: [batch], exporters: [prometheusremotewrite] }
      logs: { receivers: [otlp], processors: [batch], exporters: [loki] }

18) Backends and Storage

- Traces: Tempo/Jaeger/Elastic APM; choose based on scale and features
- Metrics: Prometheus + Mimir/Thanos for long-term; OTLP metric ingest
- Logs: Loki/Elastic; enable trace_id correlation
- Storage: object storage for cheap, durable retention

19) Semantic Conventions Cheatsheet

HTTP
- http.method, http.route, http.target, http.status_code

DB
- db.system (postgres, mysql), db.statement (redacted), db.name

Messaging
- messaging.system (kafka, rabbitmq), messaging.operation (publish, process)

20) RED/USE Dashboards (Templates)

{
  "title": "RED Overview",
  "panels": [
    {"type":"stat","title":"RPS","targets":[{"expr":"sum(rate(http_server_requests_total[1m]))"}]},
    {"type":"stat","title":"Error %","targets":[{"expr":"sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m]))"}]},
    {"type":"timeseries","title":"p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le))"}]}
  ]
}

{
  "title": "USE (Resources)",
  "panels": [
    {"type":"timeseries","title":"CPU","targets":[{"expr":"avg(rate(container_cpu_usage_seconds_total[5m])) by (pod)"}]},
    {"type":"timeseries","title":"Memory","targets":[{"expr":"avg(container_memory_working_set_bytes) by (pod)"}]}
  ]
}

21) Alerts Catalog

- alert: ErrorBudgetBurnFast
  expr: (sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m]))) > 0.05
  for: 5m
  labels: { severity: critical }
  annotations: { summary: "Error budget burning fast" }

- alert: LatencyP95High
  expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le)) > 0.5
  for: 10m
  labels: { severity: warning }

22) Privacy and Redaction

- Never record secrets or PII in span attributes
- Use processors.attributes to drop/redact sensitive keys (e.g., authorization)
- GDPR/CCPA: respect retention and access; logs exported via portal with filters

processors:
  attributes/redact:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: db.statement
        action: update
        value: "REDACTED"

23) Cost Optimization Playbook

- Adjust export intervals and histogram buckets
- Enable tail-based sampling and drop health-check spans
- Downsample old metrics; archive traces to object storage after N days
- Deduplicate labels; avoid high-cardinality user identifiers

24) Testing Harness

// Example Jest test: ensure spans are created and attributes set
it('creates span with route', async () => {
  const span = tracer.startSpan('GET /orders')
  span.setAttribute('http.route', '/orders')
  span.setStatus({ code: SpanStatusCode.OK })
  span.end()
  // assert via in-memory exporter in test env
})

25) Runbooks (Detailed)

Collector CrashLoop
- Check config syntax; reduce pipeline complexity; scale memory

Drops in Traces
- Verify client/exporter endpoints; tail sampling thresholds; gateway health

High Metrics Cardinality
- Identify top labels; rework views; reduce labels in libraries

26) Multi-Tenancy and Access Control

- Gate Collector endpoints with auth (mTLS or tokens per tenant)
- Tag data with tenant attributes; separate export paths per tenant when needed
- Dashboards and alert routes per tenant/team

27) SLO Lifecycle

- Define SLIs, set targets, and error budgets
- Create burn-rate alerts (fast/slow)
- Weekly review: trend SLOs and adjust remediation efforts

28) Cloud-Specific Notes

AWS: OTLP via NLB; IAM roles for EC2/EKS; managed Prometheus/AMP and OpenSearch
Azure: Monitor + managed Grafana; Private Link for OTLP
GCP: Managed Service for Prometheus; Cloud Trace and Logging OTLP bridges

29) Example: Payments Service Instrumentation Walkthrough

- Trace incoming HTTP → DB call → queue publish → downstream consumer
- Annotate spans with order_id (hashed), tenant_id, and route
- Emit span metrics for per-route RED metrics
- Dashboard drill-down from p95 to exemplars

Extended FAQ (161–320)

Should I export metrics over OTLP or Prometheus?
Prometheus is ubiquitous; OTLP unifies pipelines—use either or both via Collector.
How do I handle retries on exporters?
Enable retry queues; tune backoff; monitor exporter failures.
What about cold starts in serverless?
Record cold start spans; separate buckets for latency.
How to instrument gRPC?
Use gRPC instrumentations; propagate context via metadata.
Do I need exemplars?
Yes—pair metrics with trace drill-down for fast RCA.
Reduce labels?
Drop user IDs, session IDs; keep route and method.
Alert fatigue?
Use SLO burn-rate alerts; deprecate noisy static thresholds.
What about storage costs?
Downsample metrics; tail sampling; tiered trace retention.
Logs vs spans?
Prefer spans for request flow; logs for details and errors.
How to verify semantic conventions?
Unit tests + linters; library defaults for common protocols.
Should I use histograms for DB?
Yes—db.client.operation.duration with buckets per operation.
Correlate front-end and backend?
Propagate W3C trace-context headers.
Collector HPA triggers?
CPU + exporter queue depth.
Multi-region?
Regional collectors; aggregate to global view.
Zero-trust?
mTLS, authn for exporters, and network policies.
Can I push logs via OTLP?
Yes; OTLP logs supported; ensure backend compatibility.
Event logs vs audit logs?
Split pipelines and retention policies.
How to add business metrics?
Counter/histogram instruments; views for dimensions.
Validate dashboards?
Golden dashboards tested with synthetic traffic.
Final note: instrument early, iterate often.

30) Advanced Collector Processors

processors:
  transform/traces:
    trace_statements:
      - context: span
        statements:
          - set(attributes["deployment.environment"], "prod") where attributes["deployment.environment"] == nil
          - keep_keys(attributes, ["http.method","http.route","http.status_code","db.system","messaging.system","deployment.environment"]) 
  groupbyattrs:
    keys: [service.name, deployment.environment]
  memory_limiter:
    check_interval: 5s
    limit_mib: 1024
  spanmetrics:
    metrics_exporter: prometheusremotewrite
    dimensions: [http.method, http.route, http.status_code]

31) OpenTelemetry Operator (Kubernetes)

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata: { name: otel-collector, namespace: observability }
spec:
  mode: deployment
  config: |
    receivers: { otlp: { protocols: { grpc: {}, http: {} } } }
    processors: { batch: {} }
    exporters: { otlp: { endpoint: tempo:4317, tls: { insecure: true } } }
    service: { pipelines: { traces: { receivers: [otlp], processors: [batch], exporters: [otlp] } } }

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata: { name: default, namespace: prod }
spec:
  exporter:
    endpoint: http://otel-gateway:4318
  propagators: [tracecontext, baggage, b3]
  sampler: { type: parentbased_traceidratio, argument: "0.1" }

32) Browser and Mobile Instrumentation

32.1 Web

<script src="https://cdn.jsdelivr.net/npm/@opentelemetry/sdk-trace-web"></script>
<script>
  const provider = new WebTracerProvider()
  provider.addSpanProcessor(new SimpleSpanProcessor(new OTLPTraceExporter({ url: '/v1/traces' })))
  provider.register({ propagator: new W3CTraceContextPropagator() })
</script>

32.2 Mobile

iOS/Android SDKs support OTLP; propagate context to backend via headers.

33) Databases and Messaging

receivers:
  postgresql:
    endpoint: postgres:5432
    transport: tcp
  rabbitmq:
    endpoint: rabbitmq:15672
exporters:
  prometheusremotewrite: { endpoint: http://mimir/api/v1/push }
service:
  pipelines:
    metrics/postgres: { receivers: [postgresql], processors: [batch], exporters: [prometheusremotewrite] }
    metrics/rabbit: { receivers: [rabbitmq], processors: [batch], exporters: [prometheusremotewrite] }

34) Service Graph and Topology

processors:
  servicegraph:
    store: in-memory
    latency_histogram_buckets: [0.005,0.01,0.025,0.05,0.1,0.25,0.5,1]
exporters:
  otlp/graph: { endpoint: http://graphstore:4317, tls: { insecure: true } }
service:
  pipelines:
    traces/graph:
      receivers: [otlp]
      processors: [servicegraph, batch]
      exporters: [otlp/graph]

35) Multi-Region Topologies

- Regional collectors with local backends; global query via federation
- Tail-sample locally; export head samples to global for high-level graphs
- Failover: use queue exporters; persistent buffers

36) Prometheus Histograms and Exemplars

prometheus:
  enable_feature: [exemplar-storage]

histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{job="api"}[5m])) by (le))

37) SLO Playbooks (Detailed)

Availability SLO 99.9%
- SLIs: non-5xx / total
- Alerts: burn rate 14.4 over 5m/1h critical; 6 over 30m/6h warning
- Actions: rollback recent change, scale, cache responses

Latency SLO p95 < 250ms
- SLIs: p95 per route
- Alerts: p95 > 250ms for 10m
- Actions: profile, reduce payloads, add cache, adjust timeouts

38) Privacy Engineering

- Data inventory: identify PII-bearing routes
- Redaction: processors.attributes delete sensitive headers
- Pseudonymization: hash IDs before attributes
- Access: restrict debug logs with sensitive info

39) Cost Modeling

signal,rate,unit_cost,monthly_estimate
traces,5k spans/min,$0.000001/span,$216
metrics,2M samples/min,$0.20/M,$1728
logs,200GB/day,$0.02/GB,$120

- Reduce metrics sample rate; merge labels
- Aggressive sampling on low-value spans
- Log at INFO for business events; DEBUG only in staging

40) Vendor/Managed Exporters (Patterns)

exporters:
  otlphttp/datadog: { endpoint: https://api.datadoghq.com, headers: { DD-API-KEY: ${DD_API_KEY} } }
  otlphttp/newrelic: { endpoint: https://otlp.nr-data.net, headers: { api-key: ${NR_LICENSE_KEY} } }

41) Blue/Green Observability

- Tag deployments with color=blue|green
- Compare p95 and error % between colors before switch
- Rollback if deltas exceed thresholds

42) Golden Traces and Synthetic Monitoring

- Schedule synthetic journeys; tag spans synthetic=true
- Keep golden traces to detect regressions quickly

43) Example Dashboards (Expanded)

{
  "title": "API Overview",
  "panels": [
    {"type":"stat","title":"RPS","targets":[{"expr":"sum(rate(http_server_requests_total[1m]))"}]},
    {"type":"stat","title":"Error %","targets":[{"expr":"(sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m])))*100"}]},
    {"type":"timeseries","title":"p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le))"}]},
    {"type":"state-timeline","title":"Releases","targets":[{"expr":"max_over_time(release_version[24h])"}]}
  ]
}

44) Alert Routing and Ownership

route:
  receiver: default
  routes:
    - matchers: ['team="payments"']
      receiver: payments-oncall
receivers:
  - name: payments-oncall
    pagerduty_configs: [{ routing_key: ${PD_KEY} }]

45) Extended Runbooks

Problem: Slow DB spans
- Action: index missing, N+1 queries, add caching; tag spans with db.operation

Problem: Collector queue full
- Action: scale replicas, increase memory_limiter, reduce exporters

Problem: No exemplars in dashboards
- Action: ensure exemplar storage; propagate trace_id in histograms

46) Span Links, Batch, and Fan-out Patterns

- Span links: relate independent spans (e.g., retries, fan-out jobs)
- Batch processing: one parent producing many child tasks; link child spans back to original trigger
- Idempotency: tag spans with idempotency key to group retries

const parent = tracer.startSpan('batch.process')
const linkCtx = trace.setSpanContext(ROOT_CONTEXT, parent.spanContext())
const child = tracer.startSpan('worker.handle', { links: [{ context: parent.spanContext() }] }, linkCtx)

47) Data Quality and Schema Evolution

- Stable span names; avoid dynamic values in names
- Attribute schemas versioned; deprecate with processors.transform
- Metrics views: freeze bucket boundaries across services
- Logging schemas: timestamp, severity, body, attributes; include trace_id

48) Collector Scaling and HA Patterns

- Per-node agents → regional gateways → multi-region aggregation
- HPA on queue length and CPU; surge upgrades for zero downtime
- Persistent queues for exporters (file_storage) to survive restarts

exporters:
  otlp/tempo: { endpoint: http://tempo:4317, tls: { insecure: true }, sending_queue: { enabled: true, num_consumers: 8, queue_size: 5000 }, retry_on_failure: { enabled: true } }
extensions:
  file_storage: { directory: /var/lib/otel-collector/queue }
service:
  extensions: [file_storage]

49) Backpressure and Retry Tuning

exporters:
  prometheusremotewrite:
    endpoint: http://mimir/api/v1/push
    external_labels: { env: prod }
    resource_to_telemetry_conversion: { enabled: true }
    retry_on_failure: { enabled: true, initial_interval: 1s, max_interval: 30s, max_elapsed_time: 300s }
    sending_queue: { enabled: true, num_consumers: 4, queue_size: 10000 }

50) Security Hardening for Collector

- mTLS between agents and gateway; client cert rotation
- RBAC and network policies; isolate from internet
- Secret management via mounted files or CSI; no secrets in configs

51) Correlating Logs ↔ Traces ↔ Metrics

- Ensure trace_id in logs; enable exemplars on histograms with trace_id
- Build panels that jump from p95 to example trace
- Use log queries filtered by trace_id from selected traces

52) Prometheus Remote Write Nuances

- Prefer delta temporality for counters when supported
- Beware staleness markers on instance restarts; use scrape health panels
- Align histogram buckets across services to combine correctly

53) DORA and Engineering Metrics via OTel

- Lead Time for Changes: derive from deploy events
- Deployment Frequency: counter per service/environment
- Change Failure Rate: SLO breach or rollback count / total deploys
- MTTR: incident open → resolved from alert acknowledgments

54) Business KPIs with OTel Metrics

- orders.created.count, payment.success.rate, signup.latency.p95
- Tag with tenant, region, channel; avoid personally identifiable data

55) eBPF Host Metrics + OTel

- Integrate node exporter / eBPF agents; scrape via Collector
- Correlate CPU steal, disk latency with p95 spikes

56) On-Call Playbooks (Deep)

Missing Traces for Specific Route
- Check instrumentation: auto vs manual; headers propagated?
- Collector tail sampler thresholds too strict? relax temporarily
- Backend ingestion health; exporter retries/queue backlog

Cardinality Explosion Detected
- Identify top labels; drop via views/transform
- Replace user/session IDs with hashed or remove entirely
- Re-deploy libraries with conservative defaults

Slow Query in Dashboards
- Switch from raw to rollup; ensure recording rules
- Use exemplars to jump to trace and find bottleneck

57) Recording Rules and Rollups

groups:
  - name: api-latency
    interval: 1m
    rules:
      - record: job:http_server_duration_seconds_p95
        expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le, job))

58) Canned Dashboards (JSON Excerpts)

{
  "title": "Service Map",
  "panels": [
    {"type":"nodeGraph","title":"Dependencies","options":{"show":true}}
  ]
}

59) Multi-Region Failover Drills

- Simulate region failure: route exporters to secondary
- Validate queue drain and data integrity on recovery
- Compare SLOs and data completeness across regions

60) SLA/Contract Reporting

- Provide monthly SLO reports with error budget usage
- Include outage timelines, root cause, and corrective actions

61) Common Pitfalls and Anti-Patterns

- Dynamic span names and high-cardinality labels
- Logging entire payloads; leaking secrets in attributes
- Over-sampling in dev but under-sampling in prod

Mega FAQ (701–1100)

How to enforce resource attributes across services?
Collector transform processor upserts; test in CI.
Is baggage safe for PII?
Avoid PII; keep light context like tenant or plan.
Should I sample errors at 100%?
Often yes; cap if volume is extreme.
What ratio for head sampling?
Start 1–10%; adjust based on cost and utility.
Can we enrich spans with DB explain plans?
Yes sparingly; hide in debug flags; avoid in prod by default.
Log volumes too high?
Reduce log level; route business logs to analytics separately.
Should I trace health endpoints?
Drop via filter; wasteful.
How to detect missing exemplars?
Dashboard panel with exemplar presence rate.
Alert on exporter failures?
Yes; queue depth and retry counts.
Versioning instrumentation libraries?
Pin and roll out gradually; note breaking changes.
Do I need OTel for cron jobs?
Yes—trace job start/end, success/failure.
Compress OTLP?
Enable gzip where supported.
Data retention best practice?
Metrics long-term at rollups; traces short-term high fidelity.
Alert duplicate suppression?
Configure Alertmanager grouping and inhibition.
Who owns dashboards?
Service teams own service dashboards; platform maintains templates.
Golden dashboard tests?
Load synthetic data; validate panels and alerts.
Security scans on Collector image?
Yes; treat as critical path.
Multi-tenant noise isolation?
Per-tenant pipelines and quotas.
Track deploy impact automatically?
Annotate metrics with release version; timeline panels.
Can we push metrics from browser?
Limited; prefer tracing + backend metrics.
OpenMetrics vs OTLP for metrics?
Both valid; choose by ecosystem and consolidation goals.
Are exemplars expensive?
Minimal overhead; store limited exemplars per bucket.
Correlate incidents with changes?
Use release annotations; overlay with SLO charts.
How to run Collector on edge?
Lightweight config; forward to gateway; persistent queues.
Avoid duplicate spans across proxies?
Disable double instrumentation; check proxies adding spans.
Detect sampling bias?
Compare sampled vs total rates; tune policies.
Anonymize IPs in logs?
Hash or truncate; comply with privacy.
exporter out-of-order errors?
Ensure monotonic histograms; reset counters properly.
Reduce cardinality in RED panels?
Aggregate by route template; avoid query params.
Final: measure, test, and prune relentlessly.

62) Advanced Language Instrumentation Recipes

62.1 Node.js Manual Spans and Attributes

const tracer = opentelemetry.trace.getTracer('payments')
await tracer.startActiveSpan('charge.create', async (span) => {
  try {
    span.setAttribute('payment.method', 'card')
    span.setAttribute('tenant.id', hash(tenantId))
    const res = await charges.create(payload)
    span.setStatus({ code: SpanStatusCode.OK })
    return res
  } catch (e) {
    span.recordException(e as Error)
    span.setStatus({ code: SpanStatusCode.ERROR })
    throw e
  } finally { span.end() }
})

62.2 Go Context Propagation (HTTP → Kafka)

ctx, span := tracer.Start(ctx, "publish.order")
headers := make([]kafka.Header, 0)
prop := propagation.TraceContext{}
carrier := propagation.HeaderCarrier{}
prop.Inject(ctx, carrier)
for k, v := range carrier { headers = append(headers, kafka.Header{Key: k, Value: []byte(v)}) }
producer.WriteMessages(ctx, kafka.Message{Key: []byte(orderID), Headers: headers, Value: payload})
span.End()

62.3 Python Async Tasks (Celery/Arq)

ctx = baggage.set_baggage("tenant", tenant)
with tracer.start_as_current_span("job.process", context=ctx) as span:
    span.set_attribute("job.type", job_type)
    do_work()

62.4 Java Custom Attributes (Spring)

Span span = Span.current();
span.setAttribute("user.plan", user.getPlan());

62.5 .NET Enrich Handlers

.AddAspNetCoreInstrumentation(o => {
  o.EnrichWithHttpRequest = (activity, request) => activity.SetTag("http.request_id", request.Headers["X-Request-ID"].ToString());
})

63) Exporters Matrix and Tips

- OTLP gRPC: high performance, binary
- OTLP HTTP: firewall-friendly
- Prometheus: pull-based metrics; use RM write for long-term
- Logs: OTLP → Loki/Elastic; ensure trace_id in payload

64) Sampling Strategies Compared

- Head random: simple, uniform view
- Tail policy-based: keep errors/slow; lower cost
- Dynamic adaptive: adjust to targets (keep 100% errors, 10% normal)
- Hybrid: head 10% + tail errors/slow

processors:
  tailsampling:
    policies:
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: keep-slow
        type: latency
        latency: { threshold_ms: 400 }
      - name: keep-important-routes
        type: string_attribute
        string_attribute: { key: http.route, values: ["/checkout","/login"], enabled_regex_matching: false }

65) Logs Pipelines with Redaction

processors:
  attributes/logs-redact:
    actions:
      - key: http.request.header.cookie
        action: delete
      - key: user.email
        action: update
        value: "hash:${user.email}"
exporters: { otlphttp/logs: { endpoint: http://loki:4318 } }

66) Governance and Ownership

- Owners: each service owns its dashboards/alerts; platform owns shared collectors and templates
- Change policy: dashboards and alerts reviewed in PRs with code owners
- Weekly ops: SLO review, error budget status, toil tracking

67) Recording Rules Library

- record: service:http_requests:rate1m
  expr: sum(rate(http_server_requests_total[1m])) by (service)
- record: service:http_errors:rate5m
  expr: sum(rate(http_server_errors_total[5m])) by (service)
- record: service:http_p95
  expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le, service))

68) Example SLO Documents (Templates)

Service: Checkout API
SLIs: Availability (1 - 5xx/total), p95 latency < 250ms
SLOs: 99.9% monthly, p95 < 250ms
Error Budget: 43.2m/mo
Policies: fast/slow burn, freeze on 2x over 1h

69) Synthetic Journeys

- Login → Search → Add to Cart → Checkout
- Tag spans synthetic=true; exclude from user-facing metrics
- Alert if synthetic fails 3 times consecutively

70) Cost Guardrails

- Max metrics series per service: 50k
- Max attributes per span: 32 (drop overage)
- Alert on cardinality spikes; gate deploys if threshold exceeded

71) Data Retention Tiers

- Traces: 3d full, 14d sampled, 90d summaries
- Metrics: 15m raw, 1h rollup (30d), 6h rollup (1y)
- Logs: 7d hot, 90d warm, 1y archive

72) Common Dashboard Panels (Catalog)

- API RED
- Dependency latency (DB/Cache/External)
- Error taxonomy (client/server/dependency)
- Saturation (CPU/mem/threads)
- Release markers and regression panels

73) Example Policy: Attribute Dropper

processors:
  attributes/drop-high-card:
    actions:
      - key: http.user_agent
        action: delete
      - key: user.id
        action: delete

74) Resilience Tests

- Kill collector pod: exporters queue and recover
- Block backend for 2 minutes: retry queue holds; no data loss
- Spike traffic 5x: HPA scales gateway; no alert floods

75) Edge/IoT Notes

- Lightweight collectors; batch and forward when online
- Use exponential backoff; local ring buffers

Mega FAQ (1101–1500)

Should I correlate CI/CD events?
Yes—annotate dashboards and traces with release version.
Best way to detect regressions?
Golden traces, synthetic checks, and SLO burn alerts.
How to quantify noise?
Track pages/tickets per week and per team; reduce by policy.
When to shard collectors?
At CPU/memory or queue saturation; shard by service/tenant.
Onboarding new service?
Templates for instrumentation, dashboard, alerts, and SLO doc.
Does OTLP need TLS?
Yes in prod; mTLS recommended.
Attribute limits?
Enforce via processors; reject oversized payloads.
DB connection pools as SLI?
Yes—expose pool metrics; alert on saturation.
Frontend LCP/CLS?
Export RUM metrics; correlate with backend p95.
Log sampling?
Sample non-error logs; keep error logs at higher rate.
Can we aggregate logs into span events?
For critical paths; reduce log volume.
TraceId collisions?
Negligible with proper libraries; don’t roll your own.
Infra-only spans?
Avoid; focus on app/business spans.
Prometheus remote write retries?
Tune backoff and queue; watch body size limits.
Validate buckets?
Compare p95/p99 error; align across services.
Can I push metrics from lambdas?
Yes via OTLP; batch and avoid cold-start overhead.
Detect N+1 queries?
DB spans clustered by route; alert on spikes.
How to roll dashboards?
Versioned JSON; validate in CI; promote.
Alert ownership?
Service team on-call; platform for shared infra.
SLO debt?
Track error budget burn and backlog; pause features when over budget.
Secure collectors?
RBAC, network policies, mTLS, and no public ingress.
Should I drop IP addresses?
Yes in logs where privacy laws apply; hash or truncate.
Combine OTLP and vendor agents?
Prefer OTLP everywhere; bridge if needed.
Is logging necessary if tracing exists?
Yes for details and compliance; keep structured and lean.
What about GraphQL?
Span per resolver or operation; label fields carefully.
Snowball costs?
Watch cardinality and duplicate instrumentation.
Business SLOs vs tech SLOs?
Track both; business SLOs reflect user outcomes.
Post-incident improvements?
Add panels, alerts, and runbook steps; validate fixes.
Should I trace streaming?
Yes—use messaging conventions and span links.
Final: observability is a product—own it.

76) Message-Driven and Evented Systems

- Propagate context through headers: traceparent, tracestate, baggage
- Use span links when processing batches or retries
- Model consumer spans with messaging.operation=process; publisher spans with publish

processors:
  transform/messaging:
    trace_statements:
      - context: span
        statements:
          - set(attributes["messaging.system"], "kafka") where attributes["messaging.system"] == nil
          - set(attributes["messaging.operation"], "process") where attributes["messaging.operation"] == nil

77) gRPC, GraphQL, and Streaming

- gRPC: use interceptors; name spans by method; include status
- GraphQL: span per operation; avoid field-level high cardinality
- Streaming: long-lived spans with events or chunked child spans

78) Recording Rules for SLO Reports

groups:
  - name: slo
    rules:
      - record: service:sli_availability:ratio
        expr: 1 - (sum(rate(http_server_errors_total[5m])) by (service)/sum(rate(http_server_requests_total[5m])) by (service))
      - record: service:sli_latency_p95
        expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le, service))

79) Data Contracts for Observability

- Required: service.name, deployment.environment, http.method, http.route, status
- Prohibited: PII (email, phone, exact IP where restricted)
- Stability: span names must be stable across releases

80) Conformance Tests in CI

- name: conformance
  run: |
    otel-lint --config .otel-lint.yaml src/**
    promtool check rules rules/*.yaml
    jq . dashboards/*.json > /dev/null

81) Collector Config Patterns (Per-Tenant)

connectors: { }
processors:
  attributes/tenant_a:
    actions: [ { key: tenant, value: A, action: upsert } ]
  attributes/tenant_b:
    actions: [ { key: tenant, value: B, action: upsert } ]
service:
  pipelines:
    traces/tenant_a: { receivers: [otlp], processors: [attributes/tenant_a, batch], exporters: [otlp/tempo] }
    traces/tenant_b: { receivers: [otlp], processors: [attributes/tenant_b, batch], exporters: [otlp/tempo] }

82) Dashboards: Drill-Down Workflows

- Start at RED; jump to route panel → exemplar → trace
- From trace, pivot to logs via trace_id filter
- From logs, identify error class and link to runbooks

83) Incident Review Template

- Timeline with trace screenshots and SLO charts
- Root cause with spans and dependencies
- Fixes: code, infra, and alert tuning
- Follow-ups: tests, dashboards, and docs

84) Privacy Impact in Observability

- Data minimization: only metadata needed for operations
- User controls: opt-out for RUM; anonymized IPs
- Audit: evidence of redaction and access control reviews

85) Cost Guardrail Policies in Pipelines

processors:
  filter/drop-health:
    traces:
      span:
        - 'attributes["http.target"] == "/healthz"'
  transform/drop-noisy:
    trace_statements:
      - context: span
        statements:
          - delete_key(attributes, "http.user_agent")

86) Multi-Cloud Export Patterns

exporters:
  otlp/aws: { endpoint: https://otlp.amp.aws, headers: { Authorization: ${AWS_TOKEN} } }
  otlp/azure: { endpoint: https://otlp.monitor.azure.com, headers: { Authorization: ${AZ_TOKEN} } }
  otlp/gcp: { endpoint: https://otlp.googleapis.com, headers: { Authorization: ${GCP_TOKEN} } }

87) Blue/Green Release Validation Panels

{
  "title": "Blue vs Green",
  "panels": [
    {"type":"timeseries","title":"p95 Blue","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{color='blue'}[5m])) by (le))"}]},
    {"type":"timeseries","title":"p95 Green","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{color='green'}[5m])) by (le))"}]}
  ]
}

88) Golden Path Templates

- New Service Template: instrumentation boilerplate + dashboards + alerts
- SLO Template: SLIs, targets, burn alerts, ownership
- Collector Template: agent + gateway, tail sampling, exporters

89) Security Benchmarks for OTel

- No open collectors to the internet
- mTLS between all exporters and gateways
- Periodic scans and SBOM for collector images

90) Future: OTel Profiles/Continuous Profiling

- Integrate CPU/heap profiles with traces for deep RCA
- Correlate profile samples to spans via exemplars

Mega FAQ (1501–1700)

How many buckets are too many?
If query times suffer or memory spikes; start with 10–20, align across services.
Should I export 100% of logs?
No—sample and structure; keep error logs, reduce info noise.
How do I keep trace names stable?
Template: VERB + route; avoid IDs; enforce via linting.
Are span events better than logs?
For critical path details—yes; still keep structured logs for breadth.
What SLOs for control-plane?
Collector uptime 99.9%, exporter failure rate < 0.5%.
Can dashboards be versioned?
Yes—JSON in repo; review via PRs; promote via GitOps.
How to cut costs fast?
Drop health/check spans, reduce labels, tail sample aggressively.
P99 vs p95?
Start with p95 for stability; add p99 for critical flows.
Should I sample frontend traces?
Yes; e.g., 5–10% with bias for errors.
Drop big payloads?
Avoid logging payloads; redact and summarize.
Exporter TLS errors?
Check certs/CA, time skew, and SNI.
Multi-tenant query controls?
Label-based isolation and query guards.
Are histograms better than summaries?
Yes—mergeable across instances; exemplars-friendly.
Tracing cron jobs?
Yes—trace run + child tasks; alert on failures.
How to detect duplicate spans?
Dedup in analysis; fix double instrumentation (proxy + lib).
Managed vendor vs self-host?
Consider staff/time; self-host for control, managed for speed.
Should I expose dashboards publicly?
No—protect with SSO; export reports when needed.
Store PII in logs?
Avoid; use privacy engineering and redaction.
Link traces to support tickets?
Store trace_id in ticket metadata.
Final advice: own observability as a product.

91) Minimal Adoption Playbook

- Week 1: instrument 1 critical service (traces+metrics)
- Week 2: dashboards and SLOs; tail sampling
- Week 3: logs correlation; alerts; runbooks
- Week 4: template and scale to next 5 services

Micro FAQ (1701–1740)

Lock down OTLP endpoints?
Allow only internal networks; mTLS required.
Merge multi-lang services?
OTLP normalizes; enforce semantic conventions.
Spike in unknown routes?
Missing route templates; fix instrumentation.
Exporter backpressure signals?
Queue depth, retry rate, and exporter errors.
Per-tenant error budgets?
Attributes + grouping; dashboards and Alertmanager routes.
Does OTel replace APM?
OTel is the standard; many APMs ingest OTLP.
Keep span events size small?
Yes; summarize; avoid large arrays.
Blue/green SLO gates?
Block switch if SLO deltas exceed threshold.
Collector on windows?
Supported; align with service configs.
Final: iterate, measure, improve.

Micro FAQ (1741–1760)

Alert dedup across regions?
Use group labels and inhibit duplicates.
Trace context in Kafka headers?
Yes—traceparent, tracestate; baggage optional.
Track deploy impact panels?
Release timeline + p95/error overlays.
Enforce route templates?
Lint and CI tests; deny deploys on violations.
Validate storage health?
Backend write/read SLOs and error panels.
Merge service graphs across teams?
Use resource attributes and filters per team.
Cost showback by signal?
Ingest bytes and series per team; dashboards.
Isolate collector crashes?
Separate deployments per zone; circuit breakers.
OTel and feature flags?
Tag spans with flag state; analyze impact.
Done.

92) Reference Links and Learning Path

- Start: auto-instrument one service; add dashboard and SLO
- Grow: tail sampling, logs correlation, team ownership
- Mature: multi-region, cost guardrails, privacy program

Micro FAQ (1761–1800)

Per-route SLO exceptions?
Yes for non-critical routes; document and monitor.
Detect leakage of PII?
Scan attributes/logs; block keys; alert.
Inventory of metrics?
Generate from metadata; cleanup unused series.
Can I export to multiple backends?
Yes; multi-exporters per pipeline.
Collector config drift?
GitOps; diff and alert on drift.
Alert audit?
Track acknowledges and response times.
Mimir/Thanos retention tiers?
Configure downsampling and object storage.
Span enrichment from env?
resourcedetection processor; env/system/k8s detectors.
High churn in series?
Avoid dynamic labels; use recording rules.
Final: instrument, correlate, and iterate.

Micro FAQ (1801–1810)

Keep rollout safety?
Gate by SLOs and error budget burn.
Sampling metrics too?
Prefer aggregation; avoid random sampling metrics.
Alert descriptions?
Include runbook links and dashboards.
Policies as code for alerts?
Yes, store alerts JSON/YAML in repo.
Done.