Observability with OpenTelemetry: Complete Implementation Guide (2025)

Oct 27, 2025
observabilityopentelemetryotelmetrics
0

Executive Summary

This guide shows how to implement OpenTelemetry (OTel) across services: resource attributes, instrumentation for traces/metrics/logs, Collector pipelines, tail-based sampling, span metrics, semantic conventions, dashboards, alerts, SLOs, and cost controls.


1) Architecture Overview

graph TD
  A[Apps/Workers] -->|OTLP| B[OTel Collector]
  B --> C[Traces Backend]
  B --> D[Metrics Backend]
  B --> E[Logs Backend]
  C --> F[Dashboards]
  D --> F
  E --> F
- Signals: traces, metrics, logs; correlate via trace_id and resource attributes
- Agent vs Gateway: sidecar/daemonset agent → gateway → backends
- Multi-tenant: resource.service.* and tenant labels; isolated pipelines per tenant

2) Resource Attributes and Semantic Conventions

service:
  name: payments-api
  namespace: prod
  version: 2.3.1
telemetry:
  sdk:
    name: opentelemetry
    version: 1.27.0
cloud:
  provider: aws
  region: us-east-1
- Use stable attribute keys (service.name, service.version, deployment.environment)
- Adopt HTTP, DB, Messaging semantic conventions (v1.23+)

3) Tracing: Instrumentation Patterns

3.1 Node.js (Express)

import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({ url: process.env.OTLP_TRACES_URL }),
  instrumentations: [getNodeAutoInstrumentations()],
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'payments-api',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: 'prod',
  })
})

sdk.start()

3.2 Go (Gin)

exp, _ := otlptracegrpc.New(ctx, otlptracegrpc.WithEndpoint(os.Getenv("OTLP_ENDPOINT")), otlptracegrpc.WithInsecure())
tracerProvider := sdktrace.NewTracerProvider(
  sdktrace.WithBatcher(exp),
  sdktrace.WithResource(resource.NewWithAttributes(
    semconv.SchemaURL,
    semconv.ServiceNameKey.String("orders-api"),
    semconv.DeploymentEnvironment("prod"),
  )),
)
otel.SetTracerProvider(tracerProvider)

3.3 Python (FastAPI)

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider(resource=Resource.create({"service.name": "billing-api", "deployment.environment": "prod"}))
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint=os.getenv('OTLP_TRACES_URL')))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

3.4 Java (Spring Boot)

// Use opentelemetry-javaagent: -javaagent:opentelemetry-javaagent.jar -Dotel.exporter.otlp.endpoint=$OTLP

3.5 .NET (ASP.NET)

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetry().WithTracing(b => b
  .AddAspNetCoreInstrumentation()
  .AddHttpClientInstrumentation()
  .AddOtlpExporter());

4) Metrics: Instruments and Views

- Use histograms for latency with explicit buckets; counters for throughput; gauges for resources
- Configure views to control aggregation temporality and buckets
import { MeterProvider, PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics'
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc'

const exporter = new OTLPMetricExporter({ url: process.env.OTLP_METRICS_URL })
const meterProvider = new MeterProvider({
  readers: [new PeriodicExportingMetricReader({ exporter, exportIntervalMillis: 10000 })],
})
const meter = meterProvider.getMeter('payments-api')
const httpLatency = meter.createHistogram('http.server.duration')

5) Logs: OTel Log Signal

- Structure logs with attributes (trace_id, span_id, severity, body)
- Export via OTLP to log backend (e.g., Loki/Elastic)
- Attach resource attrs for tenant and environment
# Collector logs pipeline (example)
receivers:
  otlp: { protocols: { http: {}, grpc: {} } }
processors:
  batch: {}
exporters:
  otlphttp/logs: { endpoint: http://loki:4318 }
pipelines:
  logs/default:
    receivers: [otlp]
    processors: [batch]
    exporters: [otlphttp/logs]

6) Collector: Reference Pipelines

receivers:
  otlp:
    protocols: { http: {}, grpc: {} }
  prometheus:
    config:
      scrape_configs:
        - job_name: 'k8s'
          static_configs: [{ targets: ['node-exporter:9100'] }]
processors:
  batch: {}
  resourcedetection: { detectors: [env, system, k8s] }
  attributes:
    actions:
      - key: deployment.environment
        value: prod
        action: upsert
  filter/traces:
    traces:
      span:
        - 'attributes["http.target"] == "/healthz"'
  tail_sampling:
    decision_wait: 5s
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: latency
        type: latency
        latency: { threshold_ms: 500 }
exporters:
  otlp/tempo: { endpoint: http://tempo:4317, tls: { insecure: true } }
  prometheusremotewrite: { endpoint: http://mimir/api/v1/push }
  otlphttp/logs: { endpoint: http://loki:4318 }
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resourcedetection, attributes, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [resourcedetection, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [resourcedetection, batch]
      exporters: [otlphttp/logs]

7) Tail-Based Sampling

- Sample by error, latency, and key routes/users; keep exemplars for important classes
- Route sampled traces to cheaper storage after retention window

8) Span Metrics and SLOs

- Derive RED metrics (Rate, Errors, Duration) from spans
- Compute SLIs: availability (non-5xx), latency (p95), and error rate
- Map SLOs to alerts and error budgets
# Collector spanmetrics processor (example)
processors:
  spanmetrics:
    metrics_exporter: prometheusremotewrite
    dimensions: [http.method, http.route, deployment.environment]

9) Exemplars and Correlation

- Attach exemplars with trace_id to histograms
- Use trace_id in logs; enable exemplars in dashboards for drill-down

10) Baggage and Trace State

- Baggage: key/value propagation for business dimensions (e.g., tenant)
- trace_state: vendor-specific hints; avoid sensitive data

11) Dashboards and Alerts

# Error rate
sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m]))

# p95 latency (histogram)
histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le, route))
{
  "title": "Payments SLO",
  "panels": [
    {"type":"stat","title":"Availability","targets":[{"expr":"1 - (sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m])))"}]},
    {"type":"timeseries","title":"p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le))"}]}
  ]
}

12) Security and Privacy

- Redact PII in spans/logs; sampling rules to exclude sensitive routes
- TLS everywhere; auth for Collector endpoints; multi-tenant isolation
- Least privilege for exporters; segregate environments

13) Cost Controls

- Drop health-check spans; reduce attributes cardinality
- Tail-based sampling; delta temporality for metrics
- Downsample old data; retention tiers

14) Testing and Validation

- Golden traces: deterministic flows validated in CI
- Contract tests: check resource attrs, span names, and status codes
- Load tests: ensure Collector and backend capacity

15) Operations Runbooks

Symptom: High Collector CPU
- Action: enable batch, reduce processors, scale replicas, profile receivers

Symptom: Missing traces
- Action: verify headers, exporter URLs, sampling thresholds, and gateway health

Symptom: High cardinality metrics
- Action: drop labels, re-define views, and adopt exemplars selectively

JSON-LD


  • Kubernetes Cost Optimization: FinOps Strategies (2025)
  • API Security: OWASP Top 10 Prevention Guide (2025)
  • GitOps: ArgoCD/Flux Deployment Strategies (2025)

Call to Action

Need help implementing OpenTelemetry end-to-end? We instrument, build pipelines, wire dashboards, and operationalize SLOs.


Extended FAQ (1–160)

  1. Head vs tail sampling?
    Head is uniform and cheap; tail captures interesting traces based on outcome.

  2. How many buckets for latency histograms?
    Start with a power-of-two or decile strategy; align across services.

  3. Should logs also carry trace_id?
    Yes; it unlocks cross-signal navigation.

  4. How to avoid cardinality explosions?
    Drop unique IDs and high-cardinality labels; use views.

  5. How to measure SLO error budget burn?
    Burn rate alerts based on SLI windows (5m/1h, 30m/6h).

  6. Can I use OTLP over HTTP?
    Yes; OTLP/HTTP is supported and firewall-friendly.

  7. Should I use exemplars?
    Yes for quick trace drill-down from metrics panels.

  8. How to trace message queues?
    Use messaging conventions; propagate context in message headers.

  9. What about front-end tracing?
    Use web auto-instrumentations; propagate headers to backend.

  10. Multi-tenant isolation?
    Separate pipelines and auth; resource attributes for tenant id.

... (continue with practical Q/A up to 160 on instrumentation, pipelines, sampling, metrics, logs, privacy, security, cost, testing, and operations)


16) Kubernetes: Auto-Instrumentation and Deployment

apiVersion: v1
kind: Namespace
metadata: { name: observability }
---
apiVersion: apps/v1
kind: DaemonSet
metadata: { name: otel-agent, namespace: observability }
spec:
  selector: { matchLabels: { app: otel-agent } }
  template:
    metadata: { labels: { app: otel-agent } }
    spec:
      serviceAccountName: otel-agent
      containers:
        - name: agent
          image: otel/opentelemetry-collector:0.96.0
          args: ["--config=/conf/agent.yaml"]
          volumeMounts: [{ name: conf, mountPath: /conf }]
      volumes:
        - name: conf
          configMap: { name: otel-agent-config }
---
apiVersion: v1
kind: ConfigMap
metadata: { name: otel-agent-config, namespace: observability }
data:
  agent.yaml: |
    receivers:
      otlp: { protocols: { http: {}, grpc: {} } }
    processors: { batch: {} }
    exporters: { otlp: { endpoint: otel-gateway:4317, tls: { insecure: true } } }
    service:
      pipelines:
        traces: { receivers: [otlp], processors: [batch], exporters: [otlp] }

17) Collector: Helm Values (Gateway)

mode: deployment
replicaCount: 3
config:
  receivers:
    otlp: { protocols: { http: {}, grpc: {} } }
  processors:
    batch: {}
    tail_sampling:
      decision_wait: 5s
      policies:
        - name: errors
          type: status_code
          status_code: { status_codes: [ERROR] }
        - name: latency
          type: latency
          latency: { threshold_ms: 800 }
  exporters:
    otlp/tempo: { endpoint: http://tempo:4317, tls: { insecure: true } }
    prometheusremotewrite: { endpoint: http://mimir/api/v1/push }
    loki: { endpoint: http://loki:3100/loki/api/v1/push }
  service:
    pipelines:
      traces: { receivers: [otlp], processors: [tail_sampling, batch], exporters: [otlp/tempo] }
      metrics: { receivers: [otlp], processors: [batch], exporters: [prometheusremotewrite] }
      logs: { receivers: [otlp], processors: [batch], exporters: [loki] }

18) Backends and Storage

- Traces: Tempo/Jaeger/Elastic APM; choose based on scale and features
- Metrics: Prometheus + Mimir/Thanos for long-term; OTLP metric ingest
- Logs: Loki/Elastic; enable trace_id correlation
- Storage: object storage for cheap, durable retention

19) Semantic Conventions Cheatsheet

HTTP
- http.method, http.route, http.target, http.status_code

DB
- db.system (postgres, mysql), db.statement (redacted), db.name

Messaging
- messaging.system (kafka, rabbitmq), messaging.operation (publish, process)

20) RED/USE Dashboards (Templates)

{
  "title": "RED Overview",
  "panels": [
    {"type":"stat","title":"RPS","targets":[{"expr":"sum(rate(http_server_requests_total[1m]))"}]},
    {"type":"stat","title":"Error %","targets":[{"expr":"sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m]))"}]},
    {"type":"timeseries","title":"p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le))"}]}
  ]
}
{
  "title": "USE (Resources)",
  "panels": [
    {"type":"timeseries","title":"CPU","targets":[{"expr":"avg(rate(container_cpu_usage_seconds_total[5m])) by (pod)"}]},
    {"type":"timeseries","title":"Memory","targets":[{"expr":"avg(container_memory_working_set_bytes) by (pod)"}]}
  ]
}

21) Alerts Catalog

- alert: ErrorBudgetBurnFast
  expr: (sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m]))) > 0.05
  for: 5m
  labels: { severity: critical }
  annotations: { summary: "Error budget burning fast" }

- alert: LatencyP95High
  expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le)) > 0.5
  for: 10m
  labels: { severity: warning }

22) Privacy and Redaction

- Never record secrets or PII in span attributes
- Use processors.attributes to drop/redact sensitive keys (e.g., authorization)
- GDPR/CCPA: respect retention and access; logs exported via portal with filters
processors:
  attributes/redact:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: db.statement
        action: update
        value: "REDACTED"

23) Cost Optimization Playbook

- Adjust export intervals and histogram buckets
- Enable tail-based sampling and drop health-check spans
- Downsample old metrics; archive traces to object storage after N days
- Deduplicate labels; avoid high-cardinality user identifiers

24) Testing Harness

// Example Jest test: ensure spans are created and attributes set
it('creates span with route', async () => {
  const span = tracer.startSpan('GET /orders')
  span.setAttribute('http.route', '/orders')
  span.setStatus({ code: SpanStatusCode.OK })
  span.end()
  // assert via in-memory exporter in test env
})

25) Runbooks (Detailed)

Collector CrashLoop
- Check config syntax; reduce pipeline complexity; scale memory

Drops in Traces
- Verify client/exporter endpoints; tail sampling thresholds; gateway health

High Metrics Cardinality
- Identify top labels; rework views; reduce labels in libraries

26) Multi-Tenancy and Access Control

- Gate Collector endpoints with auth (mTLS or tokens per tenant)
- Tag data with tenant attributes; separate export paths per tenant when needed
- Dashboards and alert routes per tenant/team

27) SLO Lifecycle

- Define SLIs, set targets, and error budgets
- Create burn-rate alerts (fast/slow)
- Weekly review: trend SLOs and adjust remediation efforts

28) Cloud-Specific Notes

AWS: OTLP via NLB; IAM roles for EC2/EKS; managed Prometheus/AMP and OpenSearch
Azure: Monitor + managed Grafana; Private Link for OTLP
GCP: Managed Service for Prometheus; Cloud Trace and Logging OTLP bridges

29) Example: Payments Service Instrumentation Walkthrough

- Trace incoming HTTP → DB call → queue publish → downstream consumer
- Annotate spans with order_id (hashed), tenant_id, and route
- Emit span metrics for per-route RED metrics
- Dashboard drill-down from p95 to exemplars

Extended FAQ (161–320)

  1. Should I export metrics over OTLP or Prometheus?
    Prometheus is ubiquitous; OTLP unifies pipelines—use either or both via Collector.

  2. How do I handle retries on exporters?
    Enable retry queues; tune backoff; monitor exporter failures.

  3. What about cold starts in serverless?
    Record cold start spans; separate buckets for latency.

  4. How to instrument gRPC?
    Use gRPC instrumentations; propagate context via metadata.

  5. Do I need exemplars?
    Yes—pair metrics with trace drill-down for fast RCA.

  6. Reduce labels?
    Drop user IDs, session IDs; keep route and method.

  7. Alert fatigue?
    Use SLO burn-rate alerts; deprecate noisy static thresholds.

  8. What about storage costs?
    Downsample metrics; tail sampling; tiered trace retention.

  9. Logs vs spans?
    Prefer spans for request flow; logs for details and errors.

  10. How to verify semantic conventions?
    Unit tests + linters; library defaults for common protocols.

  11. Should I use histograms for DB?
    Yes—db.client.operation.duration with buckets per operation.

  12. Correlate front-end and backend?
    Propagate W3C trace-context headers.

  13. Collector HPA triggers?
    CPU + exporter queue depth.

  14. Multi-region?
    Regional collectors; aggregate to global view.

  15. Zero-trust?
    mTLS, authn for exporters, and network policies.

  16. Can I push logs via OTLP?
    Yes; OTLP logs supported; ensure backend compatibility.

  17. Event logs vs audit logs?
    Split pipelines and retention policies.

  18. How to add business metrics?
    Counter/histogram instruments; views for dimensions.

  19. Validate dashboards?
    Golden dashboards tested with synthetic traffic.

  20. Final note: instrument early, iterate often.


30) Advanced Collector Processors

processors:
  transform/traces:
    trace_statements:
      - context: span
        statements:
          - set(attributes["deployment.environment"], "prod") where attributes["deployment.environment"] == nil
          - keep_keys(attributes, ["http.method","http.route","http.status_code","db.system","messaging.system","deployment.environment"]) 
  groupbyattrs:
    keys: [service.name, deployment.environment]
  memory_limiter:
    check_interval: 5s
    limit_mib: 1024
  spanmetrics:
    metrics_exporter: prometheusremotewrite
    dimensions: [http.method, http.route, http.status_code]

31) OpenTelemetry Operator (Kubernetes)

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata: { name: otel-collector, namespace: observability }
spec:
  mode: deployment
  config: |
    receivers: { otlp: { protocols: { grpc: {}, http: {} } } }
    processors: { batch: {} }
    exporters: { otlp: { endpoint: tempo:4317, tls: { insecure: true } } }
    service: { pipelines: { traces: { receivers: [otlp], processors: [batch], exporters: [otlp] } } }
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata: { name: default, namespace: prod }
spec:
  exporter:
    endpoint: http://otel-gateway:4318
  propagators: [tracecontext, baggage, b3]
  sampler: { type: parentbased_traceidratio, argument: "0.1" }

32) Browser and Mobile Instrumentation

32.1 Web

<script src="https://cdn.jsdelivr.net/npm/@opentelemetry/sdk-trace-web"></script>
<script>
  const provider = new WebTracerProvider()
  provider.addSpanProcessor(new SimpleSpanProcessor(new OTLPTraceExporter({ url: '/v1/traces' })))
  provider.register({ propagator: new W3CTraceContextPropagator() })
</script>

32.2 Mobile

iOS/Android SDKs support OTLP; propagate context to backend via headers.

33) Databases and Messaging

receivers:
  postgresql:
    endpoint: postgres:5432
    transport: tcp
  rabbitmq:
    endpoint: rabbitmq:15672
exporters:
  prometheusremotewrite: { endpoint: http://mimir/api/v1/push }
service:
  pipelines:
    metrics/postgres: { receivers: [postgresql], processors: [batch], exporters: [prometheusremotewrite] }
    metrics/rabbit: { receivers: [rabbitmq], processors: [batch], exporters: [prometheusremotewrite] }

34) Service Graph and Topology

processors:
  servicegraph:
    store: in-memory
    latency_histogram_buckets: [0.005,0.01,0.025,0.05,0.1,0.25,0.5,1]
exporters:
  otlp/graph: { endpoint: http://graphstore:4317, tls: { insecure: true } }
service:
  pipelines:
    traces/graph:
      receivers: [otlp]
      processors: [servicegraph, batch]
      exporters: [otlp/graph]

35) Multi-Region Topologies

- Regional collectors with local backends; global query via federation
- Tail-sample locally; export head samples to global for high-level graphs
- Failover: use queue exporters; persistent buffers

36) Prometheus Histograms and Exemplars

prometheus:
  enable_feature: [exemplar-storage]
histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{job="api"}[5m])) by (le))

37) SLO Playbooks (Detailed)

Availability SLO 99.9%
- SLIs: non-5xx / total
- Alerts: burn rate 14.4 over 5m/1h critical; 6 over 30m/6h warning
- Actions: rollback recent change, scale, cache responses

Latency SLO p95 < 250ms
- SLIs: p95 per route
- Alerts: p95 > 250ms for 10m
- Actions: profile, reduce payloads, add cache, adjust timeouts

38) Privacy Engineering

- Data inventory: identify PII-bearing routes
- Redaction: processors.attributes delete sensitive headers
- Pseudonymization: hash IDs before attributes
- Access: restrict debug logs with sensitive info

39) Cost Modeling

signal,rate,unit_cost,monthly_estimate
traces,5k spans/min,$0.000001/span,$216
metrics,2M samples/min,$0.20/M,$1728
logs,200GB/day,$0.02/GB,$120
- Reduce metrics sample rate; merge labels
- Aggressive sampling on low-value spans
- Log at INFO for business events; DEBUG only in staging

40) Vendor/Managed Exporters (Patterns)

exporters:
  otlphttp/datadog: { endpoint: https://api.datadoghq.com, headers: { DD-API-KEY: ${DD_API_KEY} } }
  otlphttp/newrelic: { endpoint: https://otlp.nr-data.net, headers: { api-key: ${NR_LICENSE_KEY} } }

41) Blue/Green Observability

- Tag deployments with color=blue|green
- Compare p95 and error % between colors before switch
- Rollback if deltas exceed thresholds

42) Golden Traces and Synthetic Monitoring

- Schedule synthetic journeys; tag spans synthetic=true
- Keep golden traces to detect regressions quickly

43) Example Dashboards (Expanded)

{
  "title": "API Overview",
  "panels": [
    {"type":"stat","title":"RPS","targets":[{"expr":"sum(rate(http_server_requests_total[1m]))"}]},
    {"type":"stat","title":"Error %","targets":[{"expr":"(sum(rate(http_server_errors_total[5m]))/sum(rate(http_server_requests_total[5m])))*100"}]},
    {"type":"timeseries","title":"p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le))"}]},
    {"type":"state-timeline","title":"Releases","targets":[{"expr":"max_over_time(release_version[24h])"}]}
  ]
}

44) Alert Routing and Ownership

route:
  receiver: default
  routes:
    - matchers: ['team="payments"']
      receiver: payments-oncall
receivers:
  - name: payments-oncall
    pagerduty_configs: [{ routing_key: ${PD_KEY} }]

45) Extended Runbooks

Problem: Slow DB spans
- Action: index missing, N+1 queries, add caching; tag spans with db.operation

Problem: Collector queue full
- Action: scale replicas, increase memory_limiter, reduce exporters

Problem: No exemplars in dashboards
- Action: ensure exemplar storage; propagate trace_id in histograms

- Span links: relate independent spans (e.g., retries, fan-out jobs)
- Batch processing: one parent producing many child tasks; link child spans back to original trigger
- Idempotency: tag spans with idempotency key to group retries
const parent = tracer.startSpan('batch.process')
const linkCtx = trace.setSpanContext(ROOT_CONTEXT, parent.spanContext())
const child = tracer.startSpan('worker.handle', { links: [{ context: parent.spanContext() }] }, linkCtx)

47) Data Quality and Schema Evolution

- Stable span names; avoid dynamic values in names
- Attribute schemas versioned; deprecate with processors.transform
- Metrics views: freeze bucket boundaries across services
- Logging schemas: timestamp, severity, body, attributes; include trace_id

48) Collector Scaling and HA Patterns

- Per-node agents → regional gateways → multi-region aggregation
- HPA on queue length and CPU; surge upgrades for zero downtime
- Persistent queues for exporters (file_storage) to survive restarts
exporters:
  otlp/tempo: { endpoint: http://tempo:4317, tls: { insecure: true }, sending_queue: { enabled: true, num_consumers: 8, queue_size: 5000 }, retry_on_failure: { enabled: true } }
extensions:
  file_storage: { directory: /var/lib/otel-collector/queue }
service:
  extensions: [file_storage]

49) Backpressure and Retry Tuning

exporters:
  prometheusremotewrite:
    endpoint: http://mimir/api/v1/push
    external_labels: { env: prod }
    resource_to_telemetry_conversion: { enabled: true }
    retry_on_failure: { enabled: true, initial_interval: 1s, max_interval: 30s, max_elapsed_time: 300s }
    sending_queue: { enabled: true, num_consumers: 4, queue_size: 10000 }

50) Security Hardening for Collector

- mTLS between agents and gateway; client cert rotation
- RBAC and network policies; isolate from internet
- Secret management via mounted files or CSI; no secrets in configs

51) Correlating Logs ↔ Traces ↔ Metrics

- Ensure trace_id in logs; enable exemplars on histograms with trace_id
- Build panels that jump from p95 to example trace
- Use log queries filtered by trace_id from selected traces

52) Prometheus Remote Write Nuances

- Prefer delta temporality for counters when supported
- Beware staleness markers on instance restarts; use scrape health panels
- Align histogram buckets across services to combine correctly

53) DORA and Engineering Metrics via OTel

- Lead Time for Changes: derive from deploy events
- Deployment Frequency: counter per service/environment
- Change Failure Rate: SLO breach or rollback count / total deploys
- MTTR: incident open → resolved from alert acknowledgments

54) Business KPIs with OTel Metrics

- orders.created.count, payment.success.rate, signup.latency.p95
- Tag with tenant, region, channel; avoid personally identifiable data

55) eBPF Host Metrics + OTel

- Integrate node exporter / eBPF agents; scrape via Collector
- Correlate CPU steal, disk latency with p95 spikes

56) On-Call Playbooks (Deep)

Missing Traces for Specific Route
- Check instrumentation: auto vs manual; headers propagated?
- Collector tail sampler thresholds too strict? relax temporarily
- Backend ingestion health; exporter retries/queue backlog

Cardinality Explosion Detected
- Identify top labels; drop via views/transform
- Replace user/session IDs with hashed or remove entirely
- Re-deploy libraries with conservative defaults

Slow Query in Dashboards
- Switch from raw to rollup; ensure recording rules
- Use exemplars to jump to trace and find bottleneck

57) Recording Rules and Rollups

groups:
  - name: api-latency
    interval: 1m
    rules:
      - record: job:http_server_duration_seconds_p95
        expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le, job))

58) Canned Dashboards (JSON Excerpts)

{
  "title": "Service Map",
  "panels": [
    {"type":"nodeGraph","title":"Dependencies","options":{"show":true}}
  ]
}

59) Multi-Region Failover Drills

- Simulate region failure: route exporters to secondary
- Validate queue drain and data integrity on recovery
- Compare SLOs and data completeness across regions

60) SLA/Contract Reporting

- Provide monthly SLO reports with error budget usage
- Include outage timelines, root cause, and corrective actions

61) Common Pitfalls and Anti-Patterns

- Dynamic span names and high-cardinality labels
- Logging entire payloads; leaking secrets in attributes
- Over-sampling in dev but under-sampling in prod

Mega FAQ (701–1100)

  1. How to enforce resource attributes across services?
    Collector transform processor upserts; test in CI.

  2. Is baggage safe for PII?
    Avoid PII; keep light context like tenant or plan.

  3. Should I sample errors at 100%?
    Often yes; cap if volume is extreme.

  4. What ratio for head sampling?
    Start 1–10%; adjust based on cost and utility.

  5. Can we enrich spans with DB explain plans?
    Yes sparingly; hide in debug flags; avoid in prod by default.

  6. Log volumes too high?
    Reduce log level; route business logs to analytics separately.

  7. Should I trace health endpoints?
    Drop via filter; wasteful.

  8. How to detect missing exemplars?
    Dashboard panel with exemplar presence rate.

  9. Alert on exporter failures?
    Yes; queue depth and retry counts.

  10. Versioning instrumentation libraries?
    Pin and roll out gradually; note breaking changes.

  11. Do I need OTel for cron jobs?
    Yes—trace job start/end, success/failure.

  12. Compress OTLP?
    Enable gzip where supported.

  13. Data retention best practice?
    Metrics long-term at rollups; traces short-term high fidelity.

  14. Alert duplicate suppression?
    Configure Alertmanager grouping and inhibition.

  15. Who owns dashboards?
    Service teams own service dashboards; platform maintains templates.

  16. Golden dashboard tests?
    Load synthetic data; validate panels and alerts.

  17. Security scans on Collector image?
    Yes; treat as critical path.

  18. Multi-tenant noise isolation?
    Per-tenant pipelines and quotas.

  19. Track deploy impact automatically?
    Annotate metrics with release version; timeline panels.

  20. Can we push metrics from browser?
    Limited; prefer tracing + backend metrics.

  21. OpenMetrics vs OTLP for metrics?
    Both valid; choose by ecosystem and consolidation goals.

  22. Are exemplars expensive?
    Minimal overhead; store limited exemplars per bucket.

  23. Correlate incidents with changes?
    Use release annotations; overlay with SLO charts.

  24. How to run Collector on edge?
    Lightweight config; forward to gateway; persistent queues.

  25. Avoid duplicate spans across proxies?
    Disable double instrumentation; check proxies adding spans.

  26. Detect sampling bias?
    Compare sampled vs total rates; tune policies.

  27. Anonymize IPs in logs?
    Hash or truncate; comply with privacy.

  28. exporter out-of-order errors?
    Ensure monotonic histograms; reset counters properly.

  29. Reduce cardinality in RED panels?
    Aggregate by route template; avoid query params.

  30. Final: measure, test, and prune relentlessly.


62) Advanced Language Instrumentation Recipes

62.1 Node.js Manual Spans and Attributes

const tracer = opentelemetry.trace.getTracer('payments')
await tracer.startActiveSpan('charge.create', async (span) => {
  try {
    span.setAttribute('payment.method', 'card')
    span.setAttribute('tenant.id', hash(tenantId))
    const res = await charges.create(payload)
    span.setStatus({ code: SpanStatusCode.OK })
    return res
  } catch (e) {
    span.recordException(e as Error)
    span.setStatus({ code: SpanStatusCode.ERROR })
    throw e
  } finally { span.end() }
})

62.2 Go Context Propagation (HTTP → Kafka)

ctx, span := tracer.Start(ctx, "publish.order")
headers := make([]kafka.Header, 0)
prop := propagation.TraceContext{}
carrier := propagation.HeaderCarrier{}
prop.Inject(ctx, carrier)
for k, v := range carrier { headers = append(headers, kafka.Header{Key: k, Value: []byte(v)}) }
producer.WriteMessages(ctx, kafka.Message{Key: []byte(orderID), Headers: headers, Value: payload})
span.End()

62.3 Python Async Tasks (Celery/Arq)

ctx = baggage.set_baggage("tenant", tenant)
with tracer.start_as_current_span("job.process", context=ctx) as span:
    span.set_attribute("job.type", job_type)
    do_work()

62.4 Java Custom Attributes (Spring)

Span span = Span.current();
span.setAttribute("user.plan", user.getPlan());

62.5 .NET Enrich Handlers

.AddAspNetCoreInstrumentation(o => {
  o.EnrichWithHttpRequest = (activity, request) => activity.SetTag("http.request_id", request.Headers["X-Request-ID"].ToString());
})

63) Exporters Matrix and Tips

- OTLP gRPC: high performance, binary
- OTLP HTTP: firewall-friendly
- Prometheus: pull-based metrics; use RM write for long-term
- Logs: OTLP → Loki/Elastic; ensure trace_id in payload

64) Sampling Strategies Compared

- Head random: simple, uniform view
- Tail policy-based: keep errors/slow; lower cost
- Dynamic adaptive: adjust to targets (keep 100% errors, 10% normal)
- Hybrid: head 10% + tail errors/slow
processors:
  tailsampling:
    policies:
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: keep-slow
        type: latency
        latency: { threshold_ms: 400 }
      - name: keep-important-routes
        type: string_attribute
        string_attribute: { key: http.route, values: ["/checkout","/login"], enabled_regex_matching: false }

65) Logs Pipelines with Redaction

processors:
  attributes/logs-redact:
    actions:
      - key: http.request.header.cookie
        action: delete
      - key: user.email
        action: update
        value: "hash:${user.email}"
exporters: { otlphttp/logs: { endpoint: http://loki:4318 } }

66) Governance and Ownership

- Owners: each service owns its dashboards/alerts; platform owns shared collectors and templates
- Change policy: dashboards and alerts reviewed in PRs with code owners
- Weekly ops: SLO review, error budget status, toil tracking

67) Recording Rules Library

- record: service:http_requests:rate1m
  expr: sum(rate(http_server_requests_total[1m])) by (service)
- record: service:http_errors:rate5m
  expr: sum(rate(http_server_errors_total[5m])) by (service)
- record: service:http_p95
  expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le, service))

68) Example SLO Documents (Templates)

Service: Checkout API
SLIs: Availability (1 - 5xx/total), p95 latency < 250ms
SLOs: 99.9% monthly, p95 < 250ms
Error Budget: 43.2m/mo
Policies: fast/slow burn, freeze on 2x over 1h

69) Synthetic Journeys

- Login → Search → Add to Cart → Checkout
- Tag spans synthetic=true; exclude from user-facing metrics
- Alert if synthetic fails 3 times consecutively

70) Cost Guardrails

- Max metrics series per service: 50k
- Max attributes per span: 32 (drop overage)
- Alert on cardinality spikes; gate deploys if threshold exceeded

71) Data Retention Tiers

- Traces: 3d full, 14d sampled, 90d summaries
- Metrics: 15m raw, 1h rollup (30d), 6h rollup (1y)
- Logs: 7d hot, 90d warm, 1y archive

72) Common Dashboard Panels (Catalog)

- API RED
- Dependency latency (DB/Cache/External)
- Error taxonomy (client/server/dependency)
- Saturation (CPU/mem/threads)
- Release markers and regression panels

73) Example Policy: Attribute Dropper

processors:
  attributes/drop-high-card:
    actions:
      - key: http.user_agent
        action: delete
      - key: user.id
        action: delete

74) Resilience Tests

- Kill collector pod: exporters queue and recover
- Block backend for 2 minutes: retry queue holds; no data loss
- Spike traffic 5x: HPA scales gateway; no alert floods

75) Edge/IoT Notes

- Lightweight collectors; batch and forward when online
- Use exponential backoff; local ring buffers

Mega FAQ (1101–1500)

  1. Should I correlate CI/CD events?
    Yes—annotate dashboards and traces with release version.

  2. Best way to detect regressions?
    Golden traces, synthetic checks, and SLO burn alerts.

  3. How to quantify noise?
    Track pages/tickets per week and per team; reduce by policy.

  4. When to shard collectors?
    At CPU/memory or queue saturation; shard by service/tenant.

  5. Onboarding new service?
    Templates for instrumentation, dashboard, alerts, and SLO doc.

  6. Does OTLP need TLS?
    Yes in prod; mTLS recommended.

  7. Attribute limits?
    Enforce via processors; reject oversized payloads.

  8. DB connection pools as SLI?
    Yes—expose pool metrics; alert on saturation.

  9. Frontend LCP/CLS?
    Export RUM metrics; correlate with backend p95.

  10. Log sampling?
    Sample non-error logs; keep error logs at higher rate.

  11. Can we aggregate logs into span events?
    For critical paths; reduce log volume.

  12. TraceId collisions?
    Negligible with proper libraries; don’t roll your own.

  13. Infra-only spans?
    Avoid; focus on app/business spans.

  14. Prometheus remote write retries?
    Tune backoff and queue; watch body size limits.

  15. Validate buckets?
    Compare p95/p99 error; align across services.

  16. Can I push metrics from lambdas?
    Yes via OTLP; batch and avoid cold-start overhead.

  17. Detect N+1 queries?
    DB spans clustered by route; alert on spikes.

  18. How to roll dashboards?
    Versioned JSON; validate in CI; promote.

  19. Alert ownership?
    Service team on-call; platform for shared infra.

  20. SLO debt?
    Track error budget burn and backlog; pause features when over budget.

  21. Secure collectors?
    RBAC, network policies, mTLS, and no public ingress.

  22. Should I drop IP addresses?
    Yes in logs where privacy laws apply; hash or truncate.

  23. Combine OTLP and vendor agents?
    Prefer OTLP everywhere; bridge if needed.

  24. Is logging necessary if tracing exists?
    Yes for details and compliance; keep structured and lean.

  25. What about GraphQL?
    Span per resolver or operation; label fields carefully.

  26. Snowball costs?
    Watch cardinality and duplicate instrumentation.

  27. Business SLOs vs tech SLOs?
    Track both; business SLOs reflect user outcomes.

  28. Post-incident improvements?
    Add panels, alerts, and runbook steps; validate fixes.

  29. Should I trace streaming?
    Yes—use messaging conventions and span links.

  30. Final: observability is a product—own it.


76) Message-Driven and Evented Systems

- Propagate context through headers: traceparent, tracestate, baggage
- Use span links when processing batches or retries
- Model consumer spans with messaging.operation=process; publisher spans with publish
processors:
  transform/messaging:
    trace_statements:
      - context: span
        statements:
          - set(attributes["messaging.system"], "kafka") where attributes["messaging.system"] == nil
          - set(attributes["messaging.operation"], "process") where attributes["messaging.operation"] == nil

77) gRPC, GraphQL, and Streaming

- gRPC: use interceptors; name spans by method; include status
- GraphQL: span per operation; avoid field-level high cardinality
- Streaming: long-lived spans with events or chunked child spans

78) Recording Rules for SLO Reports

groups:
  - name: slo
    rules:
      - record: service:sli_availability:ratio
        expr: 1 - (sum(rate(http_server_errors_total[5m])) by (service)/sum(rate(http_server_requests_total[5m])) by (service))
      - record: service:sli_latency_p95
        expr: histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket[5m])) by (le, service))

79) Data Contracts for Observability

- Required: service.name, deployment.environment, http.method, http.route, status
- Prohibited: PII (email, phone, exact IP where restricted)
- Stability: span names must be stable across releases

80) Conformance Tests in CI

- name: conformance
  run: |
    otel-lint --config .otel-lint.yaml src/**
    promtool check rules rules/*.yaml
    jq . dashboards/*.json > /dev/null

81) Collector Config Patterns (Per-Tenant)

connectors: { }
processors:
  attributes/tenant_a:
    actions: [ { key: tenant, value: A, action: upsert } ]
  attributes/tenant_b:
    actions: [ { key: tenant, value: B, action: upsert } ]
service:
  pipelines:
    traces/tenant_a: { receivers: [otlp], processors: [attributes/tenant_a, batch], exporters: [otlp/tempo] }
    traces/tenant_b: { receivers: [otlp], processors: [attributes/tenant_b, batch], exporters: [otlp/tempo] }

82) Dashboards: Drill-Down Workflows

- Start at RED; jump to route panel → exemplar → trace
- From trace, pivot to logs via trace_id filter
- From logs, identify error class and link to runbooks

83) Incident Review Template

- Timeline with trace screenshots and SLO charts
- Root cause with spans and dependencies
- Fixes: code, infra, and alert tuning
- Follow-ups: tests, dashboards, and docs

84) Privacy Impact in Observability

- Data minimization: only metadata needed for operations
- User controls: opt-out for RUM; anonymized IPs
- Audit: evidence of redaction and access control reviews

85) Cost Guardrail Policies in Pipelines

processors:
  filter/drop-health:
    traces:
      span:
        - 'attributes["http.target"] == "/healthz"'
  transform/drop-noisy:
    trace_statements:
      - context: span
        statements:
          - delete_key(attributes, "http.user_agent")

86) Multi-Cloud Export Patterns

exporters:
  otlp/aws: { endpoint: https://otlp.amp.aws, headers: { Authorization: ${AWS_TOKEN} } }
  otlp/azure: { endpoint: https://otlp.monitor.azure.com, headers: { Authorization: ${AZ_TOKEN} } }
  otlp/gcp: { endpoint: https://otlp.googleapis.com, headers: { Authorization: ${GCP_TOKEN} } }

87) Blue/Green Release Validation Panels

{
  "title": "Blue vs Green",
  "panels": [
    {"type":"timeseries","title":"p95 Blue","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{color='blue'}[5m])) by (le))"}]},
    {"type":"timeseries","title":"p95 Green","targets":[{"expr":"histogram_quantile(0.95, sum(rate(http_server_duration_seconds_bucket{color='green'}[5m])) by (le))"}]}
  ]
}

88) Golden Path Templates

- New Service Template: instrumentation boilerplate + dashboards + alerts
- SLO Template: SLIs, targets, burn alerts, ownership
- Collector Template: agent + gateway, tail sampling, exporters

89) Security Benchmarks for OTel

- No open collectors to the internet
- mTLS between all exporters and gateways
- Periodic scans and SBOM for collector images

90) Future: OTel Profiles/Continuous Profiling

- Integrate CPU/heap profiles with traces for deep RCA
- Correlate profile samples to spans via exemplars

Mega FAQ (1501–1700)

  1. How many buckets are too many?
    If query times suffer or memory spikes; start with 10–20, align across services.

  2. Should I export 100% of logs?
    No—sample and structure; keep error logs, reduce info noise.

  3. How do I keep trace names stable?
    Template: VERB + route; avoid IDs; enforce via linting.

  4. Are span events better than logs?
    For critical path details—yes; still keep structured logs for breadth.

  5. What SLOs for control-plane?
    Collector uptime 99.9%, exporter failure rate < 0.5%.

  6. Can dashboards be versioned?
    Yes—JSON in repo; review via PRs; promote via GitOps.

  7. How to cut costs fast?
    Drop health/check spans, reduce labels, tail sample aggressively.

  8. P99 vs p95?
    Start with p95 for stability; add p99 for critical flows.

  9. Should I sample frontend traces?
    Yes; e.g., 5–10% with bias for errors.

  10. Drop big payloads?
    Avoid logging payloads; redact and summarize.

  11. Exporter TLS errors?
    Check certs/CA, time skew, and SNI.

  12. Multi-tenant query controls?
    Label-based isolation and query guards.

  13. Are histograms better than summaries?
    Yes—mergeable across instances; exemplars-friendly.

  14. Tracing cron jobs?
    Yes—trace run + child tasks; alert on failures.

  15. How to detect duplicate spans?
    Dedup in analysis; fix double instrumentation (proxy + lib).

  16. Managed vendor vs self-host?
    Consider staff/time; self-host for control, managed for speed.

  17. Should I expose dashboards publicly?
    No—protect with SSO; export reports when needed.

  18. Store PII in logs?
    Avoid; use privacy engineering and redaction.

  19. Link traces to support tickets?
    Store trace_id in ticket metadata.

  20. Final advice: own observability as a product.


91) Minimal Adoption Playbook

- Week 1: instrument 1 critical service (traces+metrics)
- Week 2: dashboards and SLOs; tail sampling
- Week 3: logs correlation; alerts; runbooks
- Week 4: template and scale to next 5 services

Micro FAQ (1701–1740)

  1. Lock down OTLP endpoints?
    Allow only internal networks; mTLS required.

  2. Merge multi-lang services?
    OTLP normalizes; enforce semantic conventions.

  3. Spike in unknown routes?
    Missing route templates; fix instrumentation.

  4. Exporter backpressure signals?
    Queue depth, retry rate, and exporter errors.

  5. Per-tenant error budgets?
    Attributes + grouping; dashboards and Alertmanager routes.

  6. Does OTel replace APM?
    OTel is the standard; many APMs ingest OTLP.

  7. Keep span events size small?
    Yes; summarize; avoid large arrays.

  8. Blue/green SLO gates?
    Block switch if SLO deltas exceed threshold.

  9. Collector on windows?
    Supported; align with service configs.

  10. Final: iterate, measure, improve.


Micro FAQ (1741–1760)

  1. Alert dedup across regions?
    Use group labels and inhibit duplicates.

  2. Trace context in Kafka headers?
    Yes—traceparent, tracestate; baggage optional.

  3. Track deploy impact panels?
    Release timeline + p95/error overlays.

  4. Enforce route templates?
    Lint and CI tests; deny deploys on violations.

  5. Validate storage health?
    Backend write/read SLOs and error panels.

  6. Merge service graphs across teams?
    Use resource attributes and filters per team.

  7. Cost showback by signal?
    Ingest bytes and series per team; dashboards.

  8. Isolate collector crashes?
    Separate deployments per zone; circuit breakers.

  9. OTel and feature flags?
    Tag spans with flag state; analyze impact.

  10. Done.


- Start: auto-instrument one service; add dashboard and SLO
- Grow: tail sampling, logs correlation, team ownership
- Mature: multi-region, cost guardrails, privacy program

Micro FAQ (1761–1800)

  1. Per-route SLO exceptions?
    Yes for non-critical routes; document and monitor.

  2. Detect leakage of PII?
    Scan attributes/logs; block keys; alert.

  3. Inventory of metrics?
    Generate from metadata; cleanup unused series.

  4. Can I export to multiple backends?
    Yes; multi-exporters per pipeline.

  5. Collector config drift?
    GitOps; diff and alert on drift.

  6. Alert audit?
    Track acknowledges and response times.

  7. Mimir/Thanos retention tiers?
    Configure downsampling and object storage.

  8. Span enrichment from env?
    resourcedetection processor; env/system/k8s detectors.

  9. High churn in series?
    Avoid dynamic labels; use recording rules.

  10. Final: instrument, correlate, and iterate.


Micro FAQ (1801–1810)

  1. Keep rollout safety?
    Gate by SLOs and error budget burn.

  2. Sampling metrics too?
    Prefer aggregation; avoid random sampling metrics.

  3. Alert descriptions?
    Include runbook links and dashboards.

  4. Policies as code for alerts?
    Yes, store alerts JSON/YAML in repo.

  5. Done.

Related posts