Event‑Driven Architecture: Async Messaging Patterns (2025)
Executive Summary
Event‑driven systems decouple producers and consumers, scaling independently while embracing eventual consistency. Reliability requires clear delivery semantics, idempotency, backoff/DLQ, and strong observability.
1) Core Messaging Patterns
- Point‑to‑Point (Queues): one consumer processes each message
- Publish/Subscribe (Topics): multiple consumers receive the same event
- Consumer Groups: scale out processing while preserving partition semantics
2) Delivery Semantics
- At‑Most‑Once: no retries; fastest; risk of loss
- At‑Least‑Once: retries until ack; risk of duplicates → idempotency needed
- Exactly‑Once: transactional semantics; supported in limited scopes (Kafka EOS)
3) Ordering, Keys, and Partitions
- Partition by key to keep related events (per user/order) ordered
- More partitions → higher throughput; choose key distribution carefully
- Cross‑key ordering not guaranteed; design aggregates per key
4) Idempotency and the Outbox Pattern
- Idempotency keys per operation; upserts/merge semantics
- Transactional outbox: write state + event atomically; relay outbox → broker
- Inbox on consumers to dedupe delivered messages
CREATE TABLE outbox (
id bigserial primary key,
aggregate_id bigint,
type text,
payload jsonb,
created_at timestamptz default now(),
delivered boolean default false
);
5) Saga: Orchestration vs Choreography
- Orchestration: central coordinator drives steps and compensations
- Choreography: services react to events; compensations via events; simpler, less coupling
- Keep steps idempotent; store saga state; timeouts and retries
6) Retries, Backoff, and DLQs
- Exponential backoff with jitter; cap attempts; poison‑pill quarantine
- DLQ for manual review; parking lot queues; redrive after fix
- Per‑error class policies (transient vs permanent)
7) Schema Registry and Contracts
- Avro/Protobuf/JSON Schema; versioned with compatibility rules
- Backward‑compatible changes preferred; evolve producers first
- Validate in CI; reject incompatible schemas
8) Kafka Essentials
acks=all
min.insync.replicas=2
enable.idempotence=true
max.in.flight.requests.per.connection=1
retries=10
linger.ms=10
compression.type=zstd
- Topics: partitions and replication factor tuned per throughput and durability
- Idempotent producers + EOS for exactly‑once in streams/transactions
9) RabbitMQ Essentials
- Exchanges (direct/fanout/topic); queues with DLX for dead letters
- Publisher confirms; consumer acknowledgments; quorum queues for HA
10) Cloud‑Native: Pub/Sub, Event Hubs, SNS/SQS
- GCP Pub/Sub: exactly‑once delivery (ordered keys); flow control; snapshots
- Azure Event Hubs: partitions and consumer groups; capture to storage
- AWS SNS/SQS: fanout + queue processing; FIFO for ordered keys; DLQ policies
11) Stream Processing (Flink / Kafka Streams)
- Event‑time processing with watermarks; windowed aggregations
- Stateful operators with checkpointing; exactly‑once sinks supported
- Joins: keyed streams with grace periods; handle out‑of‑order events
12) Stateful Processing and Storage
- State stores per key; RocksDB for Flink/KStreams; snapshot frequency tuned
- Externalized state: KV stores for long‑lived aggregates; idempotent updates
13) Transactional Messaging
- Outbox + transactional producer for 1PC; DB → broker consistency
- Kafka EOS: producer tx across partitions + state stores; commit/abort sequences
14) Multi‑Region and DR
- Active‑passive: mirror topics; async replication; failover runbooks
- Active‑active with conflict resolution per key (CRDTs/sagas); higher complexity
- Latency vs consistency tradeoffs documented
15) Security
- mTLS and SASL (OAuth/OIDC) for brokers; per‑topic ACLs; key rotation
- PII in payloads minimized; encrypt sensitive fields; tokenization where needed
16) Observability and Trace Propagation
- Propagate traceparent/tracestate and baggage in headers
- Create spans for producer send and consumer process; link to downstream calls
- Metrics: consumer lag, throughput, error rate, processing time p95
# OTel collector for messaging metrics/traces (excerpt)
receivers: { otlp: { protocols: { http: {}, grpc: {} } } }
processors: { batch: {} }
exporters: { prometheusremotewrite: { endpoint: http://mimir/api/v1/push }, otlp: { endpoint: tempo:4317, tls: { insecure: true } } }
17) CI/CD and IaC
name: eda-ci
on: [pull_request]
jobs:
schemas:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: registry check --compat backward schemas/
deploy:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- run: terraform apply -auto-approve
18) Runbooks
High Consumer Lag
- Scale consumers; backpressure producers; check hot partitions; optimize processing
DLQ Spikes
- Inspect error classes; fix and redrive; update policies
Ordering Violations
- Check keying; ensure single partition per aggregate; enforce FIFO where needed
JSON-LD
Related Posts
- Data Pipeline Orchestration: Airflow/Prefect/Dagster (2025)
- Database Sharding & Partitioning Strategies (2025)
- Observability with OpenTelemetry (2025)
Call to Action
Need production‑ready event systems? We design idempotent, observable pipelines with clear delivery semantics, contracts, and SLOs.
Extended FAQ (1–240)
-
When to choose queues vs topics?
Queues for point‑to‑point work distribution; topics for broadcast to many. -
How to guarantee ordering?
Partition by key; use single partition for an aggregate; FIFO where supported. -
Exactly‑once worth it?
Only when cost justified; idempotent at‑least‑once is simpler and robust. -
How to propagate tracing?
Include W3C trace headers in event metadata; bridge at consumer. -
Schema breaking changes?
Avoid; if needed, version topics or use compatibility modes with migrations. -
Prevent retry storms?
Exponential backoff, jitter, circuit breakers; DLQ after threshold. -
Consumer lag high?
Scale out; rebalance partitions; optimize processing time; hot partition mitigation. -
Large payloads?
Store in object storage; send pointer; reduce broker pressure. -
Multi‑tenant events?
Include tenant id; route and filter; isolate processing pools. -
Security and PII?
Encrypt sensitive fields; tokenization; limit access to topics.
... (continue Q/A up to 240 covering retries, DLQ, EOS, ordering, keys, contracts, stream processing, multi‑region, security, and observability)
19) Kafka Deep Dive
- Topics: partitions and replication factor (RF) sized for throughput and durability
- Compaction vs Retention: compact for latest‑value semantics; retention for logs
- Producer Tuning: batch.size, linger.ms, compression, idempotence
- Consumer Tuning: max.poll.interval.ms, max.poll.records, fetch.max.bytes
# Producer (throughput + idempotence)
acks=all
enable.idempotence=true
max.in.flight.requests.per.connection=1
batch.size=32768
linger.ms=20
compression.type=zstd
retries=10
# Consumer
max.poll.records=1000
max.poll.interval.ms=300000
fetch.max.bytes=10485760
20) Compaction and Retention Patterns
- Compacted topics hold latest by key; great for aggregations/state sync
- Retention topics for events history; mix as needed (compact+retain)
- Design keys to avoid key explosion; control value size
21) Partitioning Strategies and Rebalancing
- Keys evenly distributed; avoid monotonic hotkeys (e.g., timestamps)
- Increase partitions to scale consumers; manage rebalances with cooperative protocols
- Sticky partitioners for better batching; monitor skew
22) Consumer Lag Management
- Metrics: lag per partition, time‑lag, processing p95
- Actions: scale consumers, tune batch size, optimize processing
- Backpressure to producers; DLQ if needed
23) Exactly‑Once Semantics (EOS)
- Kafka EOS: idempotent producer + transactions + state stores
- Stream‑processing (KStreams/Flink) commits input and output atomically
- Boundaries: EOS holds within the Kafka domain; external sinks need idempotency
24) Flink Patterns
- Event‑time windows; watermarks; allowed lateness
- Checkpointing intervals tuned to latency/error goals
- RocksDB tuning (write buffer, block cache) for large states
25) Kafka Streams Patterns
- KTables for latest‑value, KStreams for event streams
- Joins and aggregations keyed and windowed; grace periods for out‑of‑order
- Interactive queries via state stores (caution in multi‑region)
26) Outbox Relay Implementations
- Relay scans outbox in DB transaction order (LSN/GTID)
- Emits to broker with idempotent producer; marks delivered atomically
- Failure: at‑least‑once; consumer dedup via inbox/outbox id
27) Inbox/Dedup Store
CREATE TABLE inbox (
message_id varchar primary key,
received_at timestamptz default now()
);
- Check before processing; insert if absent; process; commit
- TTL or periodic cleanup of inbox table
28) Ordering Guarantees and Aggregates
- Per‑key ordering by partition; maintain aggregates per key
- Global ordering not guaranteed; use sequence numbers if required per aggregate
29) Multi‑Region Replication
- MirrorMaker 2 or vendor tools; async replication with lag monitoring
- Active‑passive: promote secondary on fail; client reconfig
- Conflict handling for active‑active: CRDTs or last‑write wins per business rules
30) Security and Governance
- Per‑topic ACLs; least privilege; rotate secrets; audit logs
- PII minimization; payload encryption; tokenization
- Compliance: retention policies, legal holds, export controls
31) Observability with OpenTelemetry
- Producer send span: topic, partition, key hash, size
- Consumer process span: topic, partition, lag_ms, duration
- Propagate W3C trace headers; bridge context across async boundaries
processors:
attributes/messaging:
actions:
- key: messaging.system
value: kafka
action: upsert
32) SLOs and Error Budgets
- Producer Success: acked sends/attempts ≥ 99.9%
- Consumer Lag: P95 lag < 2s for tier‑1 topics
- Processing Runtime: p95 < 200ms per message
- Burn policies and rollback when exceeded
33) CI/CD and IaC
# Terraform for Kafka topic
resource "confluent_kafka_topic" "orders" {
topic_name = "orders"
partitions_count = 48
config = {
cleanup.policy = "compact,delete"
min.insync.replicas = "2"
retention.ms = "604800000"
}
}
34) Cost Controls
- Compression (zstd); batch and linger; reduce small messages
- Optimize partition count; scale consumers to need; avoid over‑parallelism
- Offload large payloads to object storage; fanout via pointers
35) Runbooks (Extended)
Lag Spike
- Scale consumers; throttle producers; inspect hot partitions; optimize handlers
DLQ Growth
- Classify errors; fix root cause; redrive; improve validations
Broker Under‑Replicated Partitions
- Check disk/network; replace broker; rebalance
36) Dashboards and Alerts
{
"title": "Kafka Overview",
"panels": [
{"type":"timeseries","title":"Consumer Lag","targets":[{"expr":"sum(kafka_consumergroup_lag) by (group)"}]},
{"type":"timeseries","title":"Broker URP","targets":[{"expr":"sum(kafka_broker_partition_under_replicated)"}]}
]
}
- alert: ConsumerLagHigh
expr: sum(kafka_consumergroup_lag) by (group) > 10000
for: 10m
- alert: UnderReplicatedPartitions
expr: sum(kafka_broker_partition_under_replicated) > 0
for: 5m
37) Privacy and PII
- Avoid sensitive fields; encrypt or tokenize if unavoidable
- Redact logs; deny topics with PII in non‑prod
38) Blue/Green Event Flows
- Duplicate topics (orders.v1 vs orders.v2); mirror producers; consumers switch after parity
- Shadow consumption on new processors; compare outputs; flip when ready
39) Learning Path
- Start: define delivery semantics and idempotency; implement outbox
- Next: add schema registry, retries/DLQ, and tracing
- Mature: stream processing, multi‑region, cost guardrails
Mega FAQ (241–800)
-
How many partitions?
Throughput/consumer count target; avoid too many (overhead); start modest, scale. -
Idempotency on consumers?
Inbox dedup table with TTL; or state store check. -
Poison‑pill messages?
DLQ after N retries; alert; manual or automated correction. -
Schema evolution without outages?
Backward compatibility; producers first; consumers tolerant. -
Ordering across keys?
Not guaranteed; design per aggregate; use sequence numbers per key. -
Exactly‑once to external DB?
Use idempotent upserts; outbox to DB; transactional sinks when supported. -
Consumer rebalances too frequent?
Cooperative rebalancing; increase session timeouts; tune poll interval. -
Multi‑tenant topics?
Include tenant id; isolate consumer pools; quotas and ACLs. -
Huge messages?
Store in object storage; send pointer; set broker limits. -
Trace gaps?
Ensure header propagation; create spans at producer and consumer. -
Why DLQ grows?
Permanent errors; fix schema mismatch; sanitize inputs. -
Fanout patterns?
SNS to SQS; topic to many queues; careful with duplicates. -
Contract tests?
Schema registry in CI; consumer‑driven contracts. -
Cold partitions?
Reassign or increase partition count; watch throughput. -
Multi‑region latency?
Local processing; replicate summaries; CRDTs for conflict‑free counters. -
Security posture?
mTLS/SASL; ACLs; secret rotation; PII minimization. -
Vendor vs self‑host?
Consider ops maturity; cost; feature set (EOS, DR); reliability needs. -
Alert fatigue?
Lag SLOs; severity; dedup; noise budgets. -
Disaster tests?
Broker failure; region outage; producer/consumer fallbacks. -
Final: choose semantics, ensure idempotency, observe everything.
40) Backpressure and Flow Control
- Producer‑side: rate limits, linger/batch to reduce small messages
- Broker‑side: quota per client/topic; reject on quota breach
- Consumer‑side: max in‑flight, poll intervals, commit cadence
- End‑to‑end: propagate pressure to upstream systems via errors/429s
41) Partition Management
- Right‑size partitions to target consumer parallelism; avoid thousands prematurely
- Repartition when consumer groups saturate; monitor skew before adding
- Use sticky partitioners for better batching; measure p95 after changes
42) Transactional Outbox Code (Examples)
42.1 Node.js (Postgres)
import { Client } from 'pg'
async function writeWithOutbox(client: Client, aggregateId: string, payload: any){
await client.query('BEGIN')
try {
await client.query('UPDATE accounts SET balance = balance - $1 WHERE id = $2', [payload.amount, aggregateId])
await client.query('INSERT INTO outbox(aggregate_id, type, payload) VALUES ($1,$2,$3)', [aggregateId, 'AccountDebited', JSON.stringify(payload)])
await client.query('COMMIT')
} catch(e) { await client.query('ROLLBACK'); throw e }
}
42.2 Go (MySQL)
func WriteWithOutbox(tx *sql.Tx, aggID string, payload []byte) error {
if _, err := tx.Exec("UPDATE accounts SET balance=balance-? WHERE id=?", amount, aggID); err != nil { return err }
if _, err := tx.Exec("INSERT INTO outbox(aggregate_id, type, payload) VALUES(?,?,?)", aggID, "AccountDebited", payload); err != nil { return err }
return nil
}
43) Saga Patterns (Code)
// Orchestrated saga with compensations
async function transfer({ from, to, amount }){
await reserve(from, amount)
try { await credit(to, amount) } catch (e) { await release(from, amount); throw e }
await confirm(from, amount)
}
- Store saga state with correlation ids; timeouts produce compensations
- Choreography: publish AccountDebited → consumer credits; failures send compensation events
44) Stream/Table Joins
- KStreams: stream‑table join for enrichment; materialize table from compacted topic
- Flink: keyed stream join with temporal table; manage lateness and watermarks
- Out‑of‑order handling: grace periods; side outputs for late events
45) Testing and Contract Testing
- Unit: handlers pure functions → given input event produce side effects (mocked)
- Integration: local brokers (testcontainers); golden event fixtures
- Contract tests: schema compatibility and example payloads in CI
46) Multi‑Tenant Isolation
- Separate topics per tenant tier; quotas per tenant; ACLs enforce isolation
- Tag events with tenant_id; route to dedicated consumer pools for large tenants
- Per‑tenant DLQs and metrics; dashboards with ownership
47) Blue/Green Topics and Shadow Consumers
- Produce to orders.v1 and shadow orders.v2; consumers validate parity and metrics
- Shadow consumers read v2 while v1 still canonical; flip after SLO parity
48) Governance and Ownership
- Topic ownership defined; on‑call rotations; SLOs and error budgets
- Changes: schema diffs, topic config PRs, rollout plans, rollback steps
- Evidence bundles: signed artifacts, approvals, contract checks, redrives
49) Dashboards and Alerts
{
"title": "EDA Health",
"panels": [
{"type":"timeseries","title":"Throughput","targets":[{"expr":"sum(rate(messages_processed_total[1m])) by (topic)"}]},
{"type":"timeseries","title":"Lag","targets":[{"expr":"sum(kafka_consumergroup_lag) by (group)"}]},
{"type":"timeseries","title":"DLQ Rate","targets":[{"expr":"sum(rate(dlq_messages_total[5m])) by (topic)"}]}
]
}
- alert: DLQSpike
expr: sum(rate(dlq_messages_total[10m])) by (topic) > 10
for: 15m
- alert: LagSLABreach
expr: sum(kafka_consumergroup_lag) by (group) > 5000
for: 10m
50) IaC Examples
resource "google_pubsub_topic" "orders" { name = "orders" }
resource "google_pubsub_subscription" "orders_sub" { name = "orders-sub" topic = google_pubsub_topic.orders.name ack_deadline_seconds = 20 }
51) Privacy Patterns
- Avoid PII in events; if required, encrypt fields; tokenize customer identifiers
- Route PII‑bearing topics to restricted consumers; audit access
52) Learning Modules
- Module 1: Delivery semantics and idempotency
- Module 2: Schema registry and contracts
- Module 3: Stream processing and state
- Module 4: Multi‑region and DR
Mega FAQ (801–1400)
-
Why are consumers thrashing?
Rebalances too frequent; switch to cooperative and tune timeouts. -
Dedup without DB writes?
State store with TTL hashing; memory + RocksDB for large states. -
EOS but still duplicates?
External sinks break EOS; ensure idempotent upserts or transactional sinks. -
Ordering violation for same key?
Multiple partitions per key or keying bug; fix partitioner; serialize per key. -
Poison message strategy?
Limited retries; DLQ; alert; fix or sanitize. -
Large backlog catch‑up?
Increase consumers; batch processing; raise fetch sizes; temporary configs. -
Schema evolution policy?
Producers first (backward), consumers tolerant; test fixtures in CI. -
Hot partition?
Skewed key; repartition or add keys; spread load; caching. -
Trace sampling for events?
Keep errors/slow; sample normal; propagate headers always. -
Cloud migration for topics?
Mirror + cutover; parity checks; shadow consumers. -
PII in logs?
Redact; tokenization; scanning CI; policy checks. -
Fanout to many services?
Topic‑per‑event with ACLs; avoid combinatorial explosion. -
Cross‑cloud replication?
MirrorMaker/vended; lag panels; failover playbooks. -
Cost spikes?
Compression; batch/linger; cut large payloads; optimize partition counts. -
Runaway producers?
Quotas; backpressure; circuit break on consumers. -
Testing contracts?
Consumer‑driven tests and schema registry checks in CI. -
Slow consumer detection?
Lag per group; auto‑scale; move heavy processing out of handler. -
DLQ redrive strategy?
Filter/fix; rate limit redrive; track success; close loop. -
Multi‑tenant quotas?
Per‑tenant metrics; per‑topic quotas; enforce in broker. -
Final: idempotency first, contracts second, observability always.
53) Event Versioning and Migration
- Backward compatible changes: add optional fields; never remove required
- Versioned topics (orders.v1 → orders.v2) with shadow consumers
- Bridge producers emit both for a period; consumers migrate; flip after parity
54) Handler Design and Side‑Effects
- Pure function core: input event → domain commands; side‑effects abstracted
- Idempotent effects: upsert with keys; transactional boundaries
- Timeouts and retries with compensations; circuit breakers per dependency
55) Exactly‑Once Sinks (Patterns)
- DB upserts keyed by (aggregate_id, version)
- Idempotent writes to blob stores using content hashes
- Outbox at sink for further propagation; dedup by event id
56) Flink Job Configuration (Reference)
state.backend: rocksdb
execution.checkpointing.interval: 60000
execution.checkpointing.min-pause: 10000
execution.checkpointing.timeout: 600000
execution.checkpointing.externalized-checkpoint-retention: RETAIN_ON_CANCELLATION
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 10
restart-strategy.fixed-delay.delay: 10 s
57) Kafka Streams Reference Settings
processing.guarantee=exactly_once_v2
cache.max.bytes.buffering=10485760
commit.interval.ms=100
replication.factor=3
num.stream.threads=3
58) Contracts: Consumer‑Driven
- Consumers publish expectations for payload shape and semantics
- Producers validate against consumer tests in CI
- Breakages visible before deploy; version strategy decided
59) Multi‑Cloud Eventing
- Mirror topics between providers; parity checks; shadow consumers
- Gateway relays with retries and DLQ; cost and lag dashboards
- Residency: regional topics; cross‑region replication only for aggregates
60) Blue/Green Producer Rollouts
- Produce small % to v2 topic; compare downstream metrics
- Gradually increase weight; rollback if SLOs degrade
61) Governance: Ownership and SLAs
- Each topic/service has on‑call and SLOs (throughput, lag, error rate)
- Error budgets; freeze on burn; runbooks linked in alerts
- Evidence bundles: schemas, approvals, redrive logs, incident timelines
62) Security Controls as Code (OPA)
package eda.topics
violation[msg] {
input.topic.pii == true
not input.topic.encrypted
msg := sprintf("PII topic not encrypted: %s", [input.topic.name])
}
63) Testing: End‑to‑End
- Testcontainers brokers; golden fixtures; latency/lag budgets enforced
- Chaos: broker down, slow partition, duplicate delivery
- CI gates fail on excessive lag or error rate in synthetic runs
64) Documentation Templates
- Event: name, version, schema, semantics, producer, consumers
- Topic: partitions, RF, retention/compaction, owner, SLOs
- Runbooks: lag, DLQ, ordering, schema failures
65) Cost Models
component,unit,qty,unit_cost,monthly
broker_storage_gb,GB,5000,0.1,500
broker_compute_hr,hr,2160,0.2,432
egress_tb,TB,3,90,270
- Compress; avoid chatty producers; optimize partition counts
66) Incident Templates
Incident: DLQ Surge on orders.v1
- Impact: 2% orders delayed; P95 latency +200ms
- Root Cause: schema incompatibility from producer v42
- Mitigation: rollback; fix schema; redrive
- Follow‑up: add CI contract checks; improve alert thresholds
Mega FAQ (1401–2000)
-
How to pace redrives safely?
Rate limit; backoff; monitor downstream SLOs; pause if breached. -
Should I use compacted topics for aggregates?
Yes for latest‑value; remember compaction latency; emit full snapshots on change. -
Event storm protection?
Producer quotas; consumer lag alerts; auto‑scale; shed non‑critical processing. -
Multiple consumer groups per service?
Only when needed; otherwise share group for parallelism. -
Transactional outbox table growth?
TTL archived rows; stream to storage; monitor table bloat. -
How to avoid N+1 events?
Batch related changes; coalesce; idempotent updates; cap frequency. -
Idempotency key format?
Aggregate + version/sequence; globally unique; logged. -
Can I use JSON without a registry?
Possible but risky; enforce schema via JSON Schema registry. -
Observability: trace sampling?
Keep 100% errors; tail sample slow; head sample baseline; propagate headers. -
Replay windows?
Bound by retention; archive to object store for longer replays. -
Exactly‑once across clouds?
Treat as myth; design idempotent sinks and reconciliation. -
Consuming from many topics?
Split services; limit group work; avoid head‑of‑line blocking. -
Partial ordering with parallel consumers?
Serialize by key; use partition‑aware concurrency; keep handlers small. -
Multi‑tenant quotas?
Per tenant/topic; track and alert; enforce at broker and producer. -
Why not use HTTP for everything?
Push/pull mismatch; retries amplify load; EDA smooths bursts and decouples. -
Schema registry migration?
Mirror, freeze changes, migrate consumers, then producers; audit compatibility. -
Testing DR?
Kill a broker; promote region; measure RTO/RPO; document. -
Encrypt payloads or transport only?
Both for sensitive data; minimize PII fields. -
When to add partitions?
When consumer saturation and lag persist; measure skew and hot keys first. -
Final: semantics → idempotency → contracts → observability → cost.
67) Consumer Coordination and Concurrency
- Single‑threaded per key: serialize on partition for aggregates
- Parallel handlers: bounded worker pools; ensure handler idempotency
- Backoff policies per error class; circuit breakers per dependency
68) DLQ Processing Pipelines
- Classify: schema violations, validation errors, transient downstream
- Fix: transform/sanitize or patch schema; document
- Redrive: rate‑limited to protect downstream; track redrive success rate
69) Gray Failures and Partial Outages
- Symptoms: rising lag without broker errors; slow partitions; handler stalls
- Actions: increase instrumentation; isolate hot partitions; fall back to cached paths
- Post‑mortem: add alerts on stall signals (processing time p95, handler queue depth)
70) CDC + Outbox Integration
- DB update → outbox row (txn) → CDC connector emits to broker
- Guarantees: at‑least‑once; dedupe on consumer by event id
- Use LSN/GTID to ensure ordered emission; backpressure if connector lags
71) Streaming + Batch Interplay
- Stream for low latency; reconcile in batch for accuracy
- Idempotent corrections: late events update aggregates; audit trail
- Dashboards split: real‑time vs reconciled metrics
72) Cloud‑Specific Patterns
AWS: MSK + Lambda consumers for light loads; ECS/EKS for heavy; SNS→SQS fanout
Azure: Event Hubs + Functions; capture to storage; Stream Analytics/Flink
GCP: Pub/Sub + Dataflow; snapshots for replay; Cloud Run for consumers
73) Reference Configs
kafka:
brokers: [broker-1:9092, broker-2:9092]
producer:
acks: all
enable.idempotence: true
compression.type: zstd
consumer:
max.poll.records: 1000
max.poll.interval.ms: 300000
74) Governance: RACI and SLO Packs
- R: Topic owners maintain schemas, SLOs, on‑call
- A: Platform for brokers, quotas, DR
- C: Security for encryption/PII and audits; GRC for evidence
- I: Product stakeholders for change notices
SLO Pack
- Producer ack success ≥ 99.9%
- Consumer lag P95 < 2s for tier‑1; error rate < 0.5%
- DLQ rate < 0.1% of total
75) Incident Postmortem Template
- Summary: what/when/impact
- Timeline with lag/throughput graphs and traces
- Root causes (technical/organizational)
- Actions (immediate, short‑term, long‑term) with owners and dates
76) Reference Implementations
- Idempotent order processing with outbox/inbox and EOS (Kafka Streams)
- Flink enrich/aggregate pipeline with temporal joins and windowing
- Pub/Sub ETL with Dataflow and DLQ redrive service
77) Learning Path and Labs
- Lab 1: build outbox relay and consumer idempotency
- Lab 2: implement retries/DLQ with backoff and redrive UI
- Lab 3: add schema registry; evolve producer and consumer contracts
- Lab 4: add OTel tracing and exemplars; dashboards and alerts
78) Security and Privacy Deep Dive
- Encrypt sensitive fields with envelope keys; tokenization for identifiers
- ACLs per tenant; audit access; redact logs
- Data retention policies per topic; legal holds for investigations
79) Multi‑Region Playbooks
- Active‑passive: replicate topics; failover DNS; consumer group relocation
- Active‑active: per‑key ownership by region; conflict policies; CRDT counters where fit
- Lag and split‑brain alerts; drills with RTO/RPO evidence
80) Cost Engineering
- Compress with zstd; batch/linger; avoid tiny payloads
- Right‑size partition count; optimize consumer threads; avoid over‑parallelism
- Offload large payloads to blob stores; send URIs
81) Dashboards JSON (Expanded)
{
"title": "Event Processing",
"panels": [
{"type":"timeseries","title":"Throughput/msg s","targets":[{"expr":"sum(rate(messages_processed_total[1m]))"}]},
{"type":"timeseries","title":"Lag/consumer","targets":[{"expr":"sum by (group) (kafka_consumergroup_lag)"}]},
{"type":"timeseries","title":"Processing p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(handler_duration_seconds_bucket[5m])) by (le))"}]},
{"type":"timeseries","title":"DLQ rate","targets":[{"expr":"sum(rate(dlq_messages_total[5m]))"}]}
]
}
82) Alerts (Extended)
- alert: ProducerErrorRateHigh
expr: sum(rate(producer_errors_total[5m])) / sum(rate(producer_send_total[5m])) > 0.01
for: 10m
- alert: EOSAbortSpike
expr: increase(kafka_streams_transaction_aborts_total[10m]) > 50
for: 10m
82) Idempotency Libraries and Patterns
- Idempotency keys: aggregate_id + version/sequence → dedupe
- Hash‑based content keys for blob writes; compare before write
- Store last processed sequence per key; ignore lower or duplicate
// Simple dedupe guard
async function process(event){
const seen = await inbox.exists(event.id)
if (seen) return
await handle(event)
await inbox.put(event.id)
}
83) Event Sourcing Notes
- Append‑only events per aggregate; rebuild state by folding events
- Snapshots for fast recovery; compaction and retention policies
- Query side via projections; eventual consistency by design
84) Replay Strategies
- Source from compacted topics or snapshots; bounded replay windows
- Sanitize legacy events; map versions; preserve ordering per key
- Idempotent projection updates; metric panels for replay progress
85) Consumer Patterns
- At‑least‑once with dedupe; small handler code; long‑running work offloaded
- Bulk consumers (batch process) for throughput; controlled commit cadence
- Side outputs for poison messages; enjoy visibility and control
86) Monitoring SLO Packs (Examples)
- Producer ack success ≥ 99.9% (5xx/acks)
- Consumer lag P95 < 2s (tier‑1); < 10s (tier‑2)
- Processing time p95 < 200ms
- DLQ rate < 0.1%; redrive success ≥ 95% within 24h
87) Governance: Change Windows
- Restrict schema changes before peak events (sales/launch)
- Blue/green and shadow only during freeze; rollback in runbooks
- Evidence: change approvals, schema diffs, backfills/redrives
88) Examples: Order Flow
- order.created → payment.authorized/payment.failed → order.confirmed/cancelled
- Saga ensures compensations; idempotent commands; events validated
- Observability: trace across producer→broker→consumer→DB
89) Examples: Notification Fan‑Out
- user.notified via email/SMS/push; retries per channel; DLQ escalation
- Idempotent sends with outbox; provider responses logged
90) Examples: Stream Enrichment
- clicks stream joined with user profile table (compacted)
- Output: enriched.clicks to analytics/storage; late events handled via grace
91) Blueprints Summary
- Reliable: idempotency, contracts, retries, DLQ, tracing
- Performant: partitions sized right, batching, compression
- Governed: owners, SLOs, change windows, evidence bundles
Mega FAQ (2001–2400)
-
Time vs event ordering?
Use event time for windows; enforce per‑key order via partitioning. -
Failure transparency?
Emit failure events and DLQ; operators need visibility. -
Consumer idempotency without DB?
State stores with TTL (KStreams/Flink); content hashes; inbox dedupe. -
EOS with Debezium?
Treat as at‑least‑once; idempotent sinks; handle duplicates. -
Split topics vs single mixed?
Split by domain; avoid huge fan‑outs; ACLs simpler per topic. -
Trace lost across async?
Ensure headers in metadata; set span links when processing. -
Reroute hot partition?
Repartition by composite key; adjust partitioner; scale consumers. -
Validate handler purity?
Unit tests for pure core; side‑effect interfaces mocked. -
Stuck consumers?
Alert on stagnant offsets; restart policy; increase poll records. -
Key rotation?
Versioned keys; update producer and consumers; transition window. -
Multi‑tenant noisy neighbor?
Quotas; separate consumer pools; per‑tenant DLQs. -
Schema handshake?
Consumers verify schema on startup; fail fast if incompatible. -
Contract drift detection?
Registry CI and production sampling; alert on unknown fields. -
Testing EOS?
Fault injection and replay; verify state changes exactly once. -
Disaster comms?
Templates; impact, ETA, mitigations; next update time. -
PII minimization?
Avoid unnecessary fields; tokenize; encrypt sensitive payload parts. -
Broker upgrades?
Rolling; partition migration schedules; thorough testing. -
Hybrid cloud?
Mirror topics; edge processing; minimize cross‑cloud chat. -
Alert flapping?
Hysteresis; burn‑rate alerts; dedup and inhibit rules. -
Final: simplicity + idempotency + observability.
92) End‑to‑End Latency Budgeting
- Producer → Broker Ingress: batching + compression (aim < 10ms average)
- Broker → Consumer Fetch: poll cadence + fetch size (aim < 50ms average)
- Consumer Handling: handler p95 < 200ms; offload long work asynchronously
- Downstream Calls: circuit breakers; retries with jitter; timeouts aligned to SLO
- Total Budget: 300–500ms for tier‑1 where feasible; dashboards per segment
93) Contract Lifecycle and Registry Operations
- Propose change → validate compat in CI → stage producers/consumers → progressive rollout → deprecate old
- Registry policies: backward compatibility by default; breaking changes require CAB and version bump
- Evidence: schema diffs, approvals, and rollout windows
94) Blue/Green + Shadow Strategies (Detailed)
- Shadow Consumers: read new topic/version; write to shadow stores; parity compare (counts, metrics)
- Blue/Green Producers: emit to v1 & v2; compare downstream KPIs; cut over when within tolerance
- Rollback: stop v2 produce; continue processing v1; clear shadow deltas
95) Handler Reliability Patterns
- At‑least‑once + idempotent sinks as default; avoid long DB txns
- Split large jobs: chunk processing; commit progress frequently
- Side outputs to DLQ with context (schema version, error class)
96) Observability Playbooks
- Lag ↑, Throughput ↓: scale consumers; inspect hot partitions; reduce handler latency
- DLQ ↑: classify; fix; redrive; increase validation; tighten contracts
- Processing p95 ↑: profile handlers; cache; batch writes; limit downstream concurrency
97) Multi‑Region Consistency Patterns
- Single‑writer per aggregate across regions: route by key → region (ownership map)
- Conflict resolution: CRDT counters/sets; last‑write‑wins for non‑critical
- Replication health dashboards: lag, error rate, replay progress
98) Security Posture and PII
- Inventory topics for PII; encrypt fields; tokenize identifiers
- ACLs per team/tenant; least privilege; short‑lived credentials
- Legal holds and retention overrides; audit exports with evidence
99) Cost Guardrails (Advanced)
- Producer batch/linger tuned; zstd compression
- Partition counts aligned to consumer cores; avoid waste
- Offload payloads; pointer events; archive cold topics to storage
100) DR/BCP for Event Systems
- Broker cluster failover: promote standby; redirect producers/consumers; preserve offsets or resubscribe
- Region evacuation: DNS and client configs; rate‑limited replay; measure RTO/RPO
- Evidence bundles per drill: timelines, metrics, outcomes
101) Example Ownership and RACI
- Topic Owner: team‑payments
- On‑Call: payments‑oncall
- SLOs: producer 99.9% ack success; consumer lag P95 < 2s; DLQ < 0.1%
- Platform: broker operations; quotas; replication; DR
- Security: ACL reviews; PII audits; encryption policies
102) CI Gates and Synthetic Tests
- name: contract-tests
run: registry check --compat backward schemas/
- name: synthetic-lag-test
run: ./scripts/run_synthetic.sh --duration 60s --lag-threshold-ms 2000
103) Example End‑to‑End Trace
- Span 1 (producer.send orders.v1): topic, partition, key_hash
- Span 2 (broker.ack): acks=all; millis
- Span 3 (consumer.process orders.v1): lag_ms, handler ms
- Span 4 (db.upsert order): rows=1; duration
104) Platform Backlog
- Add per‑tenant quotas and dashboards; automate DLQ classification
- Schema diff bot in PRs; contract approval workflow
- Multi‑region replication lag panels and drills quarterly
105) Learning Resources
- Kafka/Streams/Connect docs; Flink docs; Debezium; OpenTelemetry messaging semantics
- "Designing Data‑Intensive Applications" (Kleppmann)
Mega FAQ (2601–3200)
-
Producer retries vs duplicates?
Retries create duplicates; rely on idempotent sinks and keys; cap retries. -
Sequence gaps per key?
Out‑of‑order or loss; restore via replay; validate in projections. -
Grace period size?
Based on lateness distribution; start P95*2; tune with metrics. -
EOS performance overhead?
Higher latency/CPU; weigh vs idempotent sinks; benchmark. -
Consumer scaling: threads vs partitions?
Bound threads ≤ partitions per instance; avoid context thrash. -
Should handlers call many downstream services?
Keep handlers small; offload fan‑out; queue further work. -
Duplicate suppression window?
TTL matched to max replay lag; clean inbox periodically. -
Topic per event or per aggregate?
Per domain/event type common; avoid overly granular topics. -
Compaction lag side effects?
Projections see stale state temporarily; design for eventual. -
Redrive safety?
Rate limit; canary redrive; monitor p95 and error rate. -
Global consumer groups across regions?
Prefer region‑local groups; mirror; reconcile. -
Event size limits?
Set enforceable max; reject oversized; pointer to storage. -
Broker quota design?
Per client/tenant/topic; alert on breaches; backpressure. -
Should I rely on HTTP for occasional events?
Queues buffer and smooth; use EDA for bursty/offline resilience. -
Monitoring exemplars?
Enable exemplars in histograms; drill to traces; fix hotspots fast. -
Data contracts vs schemas?
Schema = structure; contract = semantics + SLAs + ownership. -
Archive strategy?
Export to storage; index metadata; allow replays. -
Observability cost?
Sample traces; aggregate metrics; avoid high‑cardinality labels. -
Who approves breaking schema?
CAB + topic owner + consumers; version bump; migration plan. -
Final: pick semantics, enforce idempotency, observe relentlessly.
106) Anti‑Patterns and How to Fix Them
- Synchronous chains disguised as events: fix by durable queues and decoupled handlers
- No idempotency: add keys and inbox/outbox; enforce upserts
- Too many tiny topics: consolidate by domain; use routing keys
- Ignoring contracts: enforce schema registry in CI; consumer‑driven tests
- Blind retries: backoff + jitter; classify errors; DLQ on threshold
107) Producer/Consumer Blueprints
Producer
- Validate schema; enrich metadata (trace/baggage); batch + compress; handle acks
Consumer
- Deserialize and validate; dedupe; handle; commit; metrics + traces
108) Synthetic Journeys
- Inject synthetic events with tags; verify end‑to‑end latency and error rates
- Panels for synthetic success and p95; alert when outside bounds
109) Change Management and CAB
- High‑risk: breaking schemas, topic retention/compaction changes, partition changes
- Require RFCs, roll plans, parity metrics, rollback steps, owners
110) Example CI/CD Templates (Extended)
name: eda-pipeline
on: [pull_request]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pnpm i && pnpm lint
contracts:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: registry check --compat backward schemas/
integration:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: ./scripts/test_broker.sh
deploy:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- run: terraform apply -auto-approve
111) Dashboard Library
{
"title": "EDA Ownership",
"panels": [
{"type":"table","title":"Topics → Owners","targets":[{"expr":"eda_topic_owner_info"}]},
{"type":"stat","title":"DLQ %","targets":[{"expr":"sum(rate(dlq_messages_total[5m]))/sum(rate(messages_processed_total[5m]))"}]}
]
}
112) Alert Library
- alert: ProducerAckDrop
expr: (sum(rate(producer_errors_total[5m])))/(sum(rate(producer_send_total[5m]))+1e-6) > 0.01
for: 10m
- alert: HandlerP95High
expr: histogram_quantile(0.95, sum(rate(handler_duration_seconds_bucket[5m])) by (le)) > 0.2
for: 10m
113) DR Evidence Pack
{
"drill": "region-failover-2025-10-27",
"rto_ms": 54000,
"rpo_ms": 12000,
"commands": ["promote", "redirect", "redrive"],
"outcomes": {"success": true}
}
114) Team Backlog Examples
- Add per‑tenant quotas; implement shadow consumers for all tier‑1 topics
- Automate DLQ classification and redrive; add cost dashboards
- Contract diff bot + CAB workflow for high‑risk schema changes
115) Knowledge Base Index
- Schemas & Contracts
- Outbox/Inbox Guides
- Retry/DLQ Policies
- Tracing/OTel Setup
- DR Playbooks and Evidence
116) Ownership, On‑Call, and SLOs
- Each topic/service has a named owner and on‑call rotation
- SLOs tracked: producer ack %, consumer lag P95, processing p95, DLQ %
- Error budget policy: freeze high‑risk changes when burn > 2×
117) DR Drills and Evidence
- Quarterly broker failover and region evacuation drills
- Scripts to promote/redirect and redrive; capture timelines and success
- Evidence bundles: artifacts, approvals, metrics before/after, outcomes
118) Expanded Examples
- Payments: auth → debit/credit → ledger with saga and idempotency
- Notifications: fan‑out via SNS→SQS/Topic→Queues; per‑channel retries
- Analytics: stream enrich → aggregate → store; nightly reconcile
119) Team Backlog and Roadmap
- Contract diff bot; mandatory registry checks in CI
- DLQ classification/redrive automation; per‑tenant quotas and dashboards
- Multi‑region replication lag panels; quarterly DR drills
120) References
- Kafka/Streams, Flink, Debezium, OpenTelemetry Messaging Semantics
- Designing Data‑Intensive Applications (Kleppmann)
- Confluent, RabbitMQ, Pub/Sub, Event Hubs field guides
Closing Notes
Event‑driven systems thrive on simplicity—clear semantics, idempotency everywhere, enforceable contracts, and relentless observability. Treat topics and handlers as products with owners, SLOs, and runbooks.
Mega FAQ (3601–4000)
-
How to enforce contract use in all repos?
CI templates + required checks; fail builds without registry validation. -
Can I share handler code across services?
Only small libraries; avoid tight coupling; keep handlers small and pure. -
Stream joins exploding state?
Tune windows; use time‑bounded joins; prune keys; externalize rarely. -
Producer retries too aggressive?
Cap attempts; use backoff/jitter; alert on spikes; protect downstream. -
Graceful downgrade?
Drop non‑critical processing first; protect tier‑1 consumers and topics. -
Multi‑tenancy visibility?
Per‑tenant metrics, quotas, DLQs, and dashboards; owners accountable. -
Why shadow consumers?
To validate parity before switching versions; avoid blind cutovers. -
EOS + outbox together?
Fine; but outbox/idempotent sinks already robust; benchmark overheads. -
Can I rely on cloud auto‑scaling only?
Add lag‑based autoscaling; prevent thrash; custom metrics help. -
Final: semantics → idempotency → contracts → observability → drills.
121) Broker Tuning Cheat‑Sheet (Kafka, RabbitMQ, Pub/Sub)
Kafka
- acks=all, min.insync.replicas>=2, linger.ms 5–20ms, batch.size 64–256KB
- compression: zstd or lz4; max.in.flight.requests.per.connection=1 for EOS
- fetch.max.bytes, message.max.bytes tuned to payload sizes; replication.factor=3
RabbitMQ
- quorum queues for durability; prefetch per consumer; lazy queues for bulk
- publisher confirms enabled; delivery_mode=2; heartbeat configured; TLS only
Pub/Sub
- enable ordering keys if strict order per key; tune ack deadlines
- dead letter topics with maxDeliveryAttempts; message retention policies
122) Schema Evolution Deep Dive
- Strategy: forward+backward compatibility for N versions; no required field removals
- Additive changes: new optional fields with sensible defaults; enums append‑only
- Prohibit: field type changes, semantic overloads, reusing field IDs (Avro/proto)
- Process: proposal → diff review → contract tests → registry publish → phased rollout
123) Delivery Semantics in Practice
- At‑least‑once: default; needs idempotent handlers + dedupe keys
- At‑most‑once: acceptable for telemetry; never for money/state change
- Exactly‑once: combine outbox + idempotent sink or transactional writers; heavier
- Decision: pick per event type; document rationale and tests
124) Idempotency Storage Patterns
SQL
- table idempotency_keys(id TEXT PK, created_at TIMESTAMP, ttl)
- upsert on first process, ignore on duplicates; prune with TTL job
Redis
- SET key NX EX ttl; proceed only if set; store status for observability
S3
- write‑once object per key; HEAD before PUT; treat presence as processed
125) Test Matrix and Golden Cases
- Contract: schema fixtures (valid, invalid, edge); fuzz required/optional paths
- Handler: unit tests for pure logic; property‑based tests for invariants
- Integration: dockerized broker; replay fixtures; verify offsets and side‑effects
- Chaos: drop, duplicate, delay; ensure bounded retries, DLQ thresholds
126) Local Stack (Compose Snippet)
services:
zookeeper:
image: confluentinc/cp-zookeeper:7.6.0
environment:
ZOOKEEPER_CLIENT_PORT: 2181
kafka:
image: confluentinc/cp-kafka:7.6.0
ports: ["9092:9092"]
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
127) Blue‑Green for Consumers
- Deploy new consumer as shadow (no commits) → compare outputs to baseline
- Switch to committing group; observe parity dashboards and DLQ drift
- Rollback plan ready; feature flags gating new paths
128) Multi‑Region Replication Topologies
- Active/Active: bidirectional with conflict keys; higher ops cost; fast RTO
- Active/Passive: async mirror; simpler; potential RPO minutes
- Broker‑level mirroring vs app‑level replication; choose per semantics
129) Security Hardening Checklist (Extended)
- Network: private endpoints, mTLS everywhere, broker ACLs per principal
- Data: topic‑level classification; field encryption for PII; tokenization options
- Access: least privilege for producers/consumers; short‑lived creds; just‑in‑time
- Supply chain: signed client libs; SBOM; SLSA provenance for handlers
- Runtime: seccomp/AppArmor; read‑only FS; secret rotation runbooks
130) Cost Controls and Budgets
- Partition counts sized to throughput; rightsize broker tiers; compression on
- Batch aggressively; tune linger; eliminate chatty events; aggregate where possible
- DLQ sinks lifecycle policies; archive cold storage; sampling for low‑value topics
131) Capacity Planning Guide
- Measure: avg/peak bytes/s, msgs/s, key cardinality, p95 payload size
- Model: target utilization 50–60%; headroom for rebalances and failover
- Load test with stepped increases; record saturation points and error modes
132) Data Governance and PII
- Tag events with data classification; block PII on public or analytics topics
- Pseudonymize customer IDs; store mapping in protected vault with access audit
- Right to erasure: delete or tombstone per key; compaction friendly design
133) Compliance Footprints
- SOX: evidence of approvals for contract changes; DR drill evidence; access reviews
- HIPAA: BAAs for brokers; PHI encryption; minimum necessary disclosures
- GDPR: data minimization; retention policies; subject access workflows
134) SLOs and PromQL Queries
- Producer ack SLO: 99.9% in 5m windows → (1 - errors/sends) > 0.999
- Consumer lag P95 < 5s → histogram_quantile(0.95, lag_seconds)
- Handler success ratio > 99.5% → 1 - (errors/processed)
135) Benchmark Harness Outline
k6 run load.js # publish bursty and steady waves
vegeta attack | tee res # consumer HTTP side‑effects
promtool check rules # validate alerts
136) Backfill and Replay Runbook
- Create isolated replay consumer group; throttle consumption; pause on pressure
- Mark reprocessed keys to prevent duplicate side‑effects; write audit trail
- After replay, reconcile aggregates; update checkpoints; close incident doc
137) Archival, Retention, and Legal Hold
- Tiered storage for cold topics; lifecycle policies; encryption at rest
- Legal holds: freeze deletion; export snapshots; track chain of custody
138) Troubleshooting Heuristics
- Rising lag + normal CPU: key hotspots; consider repartitioning or key hashing
- High DLQ % after deploy: rollback first; inspect contract diffs; reprocess plan
- Producer acks falling: network or ISR shrink; check broker health
139) End‑to‑End Examples (Narrative)
Order Journey
1) order.created → payment.authorized → inventory.reserved → shipment.requested
2) saga compensations on failure; customer.notified at terminal states
3) analytics.updated for funnel; marketing.event for downstream batch
140) Extended References
- Kafka: Exactly‑Once Semantics docs; MirrorMaker 2; Transactions
- RabbitMQ: Quorum queues; Streams plugin; Federation
- GCP/AWS/Azure messaging best practices; OTel semantic conventions for messaging
Mega FAQ (4001–4100)
-
How to roll contract changes with 0 downtime?
Dual‑write/dual‑read; canary consumers; registry gating; shadow validation. -
What if I must change a field type?
Emit new field/subject; deprecate old; migrate consumers; later remove. -
How to detect hot keys?
Cardinality + skew panels; top‑N by partition utilization. -
When to split a topic?
Divergent consumers, retention/SLOs differ, schema growth causing bloat. -
Can I share partitions across tenants?
Yes with quotas and isolation policies; monitor noisy neighbor effects. -
Handling GDPR deletion with compaction?
Tombstone per key and recompact; ensure downstream caches purge. -
Why are rebalances frequent?
Flapping readiness, uneven partitions, too many consumers, missing cooperative mode. -
Is DLQ always required?
For tier‑1; for low‑value telemetry, drop with sampling. -
What’s a good error budget policy?
Freeze risky deploys when burn rate >2×; prioritize reliability backlog. -
Final mantra: design for failure; observe relentlessly; evolve with contracts.