Database Sharding & Partitioning Strategies for Scale (2025)
Executive Summary
Sharding and partitioning enable horizontal scale and availability—but they add routing, consistency, and operations complexity. This guide provides patterns, checklists, and runbooks to design robust sharded systems.
1) Core Concepts
- Partitioning: split data within one logical database (by range/hash/list)
- Sharding: split data across multiple independent databases/servers
- Goals: throughput, storage scale, isolation, and fault containment
- Costs: cross‑shard queries/transactions, rebalancing, operational burden
2) Sharding Patterns
2.1 Range Sharding
- Key ranges assigned to shards; ordered scans are efficient
- Skew risk: hotspots at range heads/tails; mitigate by sub‑ranging
2.2 Hash Sharding
- Hash(key) → shard; uniform distribution; scans need fan‑out
- Great for write load and hot tenant isolation with good keys
2.3 Directory/Lookup Sharding
- Central mapping service tenant→shard; supports per‑tenant placement/moves
- Single dependency; cache and replicate mapping for availability
2.4 Composite/Hybrid
- Example: tenant hash → shard, table partition by time inside shard
- Combines write distribution with time‑based data management
3) Choosing a Shard Key
- High cardinality, stable, naturally present in access paths (tenant_id, user_id)
- Avoid UUID v1/v7 monotonic hot ranges unless hashed/prefixed
- Support common queries without excessive fan‑out
- Document rationale and expected distribution
4) Relational Partitioning (Postgres/MySQL)
4.1 Postgres Native Partitioning
CREATE TABLE orders (
order_id BIGINT,
tenant_id BIGINT,
created_at TIMESTAMPTZ,
amount NUMERIC,
PRIMARY KEY (order_id, created_at)
) PARTITION BY RANGE (created_at);
CREATE TABLE orders_2025_10 PARTITION OF orders
FOR VALUES FROM ('2025-10-01') TO ('2025-11-01');
-- Indexes per partition as needed
- Partition pruning boosts range queries; attach/detach partitions for archival
- Combine with hash partitioning where skew exists (sub‑partitioning)
4.2 MySQL/InnoDB Partitioning
CREATE TABLE events (
id BIGINT AUTO_INCREMENT,
tenant_id BIGINT,
created_at DATETIME,
data JSON,
PRIMARY KEY (id)
) PARTITION BY RANGE (TO_DAYS(created_at)) (
PARTITION p202510 VALUES LESS THAN (TO_DAYS('2025-11-01'))
);
- For very large tables, partition management and prune‑aware queries are key
5) Sharded Relational: Vitess, Citus
- Vitess: sharded MySQL with VSchema, routing, resharding, gtid‑aware moves
- Citus: sharded Postgres with distributed tables, reference tables, colocation
-- Citus example
distribute_table('orders', 'tenant_id');
create_distributed_table('events', 'tenant_id');
-- colocate for cross‑table joins on tenant_id
-- Vitess VSchema (excerpt)
{
"tables": {
"orders": { "column_vindexes": [{ "column": "tenant_id", "name": "hash" }] }
},
"vindexes": { "hash": { "type": "hash" } }
}
6) NoSQL Sharding: MongoDB, Cassandra
6.1 MongoDB
sh.enableSharding("app");
sh.shardCollection("app.orders", { tenant_id: 1, _id: 1 });
- Choose shard keys with even distribution; consider hashed shard keys
- Zone sharding for residency and placement control
6.2 Cassandra
CREATE TABLE orders (
tenant_id bigint,
order_id timeuuid,
created_at timestamp,
amount decimal,
PRIMARY KEY ((tenant_id), created_at, order_id)
) WITH CLUSTERING ORDER BY (created_at DESC);
- Partition key defines shard; tune RF/consistency (LOCAL_QUORUM)
7) Global & Secondary Indexes
- In sharded systems, global secondary indexes are expensive; prefer local indexes
- For lookup by alternate key, maintain a mapping (email→tenant_id) in a separate table/store
- Keep mapping updates idempotent and atomic with base writes (upsert)
8) Cross‑Shard Queries
- Fan‑out/fan‑in via router; limit result sets; paginate
- Pre‑aggregate per shard and combine; use analytics systems for heavy scans
- Colocate related data by design to minimize cross‑shard joins
9) Cross‑Shard Transactions
- Prefer Sagas (compensating actions) over 2PC for availability and simplicity
- If 2PC is required, constrain scope and frequency; monitor coordinator health
- Idempotency keys and message dedupe to achieve eventual consistency
10) Hot Key Mitigation
- Hash prefix: h(tenant_id)||tenant_id to spread writes
- Write sharding: multiple buckets per hot key; read with fan‑in
- Cache and queue writes; backpressure; increase partition count
11) Multi‑Tenant Design
- Tenant‑centric shard key; per‑tenant rate limits and quotas
- Residency: pin tenants to regions/shards; directory maps tenant→shard/region
- Noisy neighbor controls: isolation pools and hot‑tenant throttling
12) Resharding & Rebalancing
- Consistent hashing to reduce movement on shard count changes
- Directory moves: move tenant N to shard K; dual‑write then cutover
- Online move protocols: copy → dual‑write → catchup → switch DNS/router → validate → retire
graph TD
Copy[Initial Copy] --> Dual[Dual‑Write]
Dual --> Catchup[CDC Catch‑up]
Catchup --> Switch[Routing Switch]
Switch --> Validate
Validate --> Retire
13) Online Migrations
- Dual‑write (old+new) with idempotency; capture + replay (CDC) for drift
- Shadow reads to validate parity; cutover with fast rollback
- Versioned schemas with backfills and guarded feature flags
14) Observability
- RED metrics per shard: throughput, error rate, p95 latency
- Hot key dashboards: top tenants/keys by QPS and latency
- Router metrics: fan‑out counts, slow shards, retry rates
# p95 by shard
histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket[5m])) by (le, shard))
15) Backups, DR, and Consistency
- Per‑shard backups with tested restores; point‑in‑time where supported
- Region‑level DR: async replication; RPO/RTO documented; drills
- Consistency levels per operation; clarify client expectations
16) CI/CD and IaC
name: sharding-ci
on: [pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pytest -q
migrate:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: alembic upgrade head
# Terraform snippets for shard instances and routers
17) Runbooks
Hot Shard
- Identify offending keys; apply write sharding; throttle; plan reshard
Router Latency Spike
- Check slow shard health; enable partial responses; cache reads
Resharding Plan
- Simulate distribution; schedule copy; dual‑write and cutover; validate; retire
JSON-LD
Related Posts
- ClickHouse Analytics Database: Performance Guide (2025)
- Data Pipeline Orchestration: Airflow, Prefect, Dagster (2025)
- Event‑Driven Architecture: Async Messaging (2025)
Call to Action
Designing a shard strategy or planning a reshard? We build keys, routers, and online migration playbooks with SLOs and observability.
Extended FAQ (1–220)
-
When do I shard vs partition?
Partition first within one DB; shard when single node limits or isolation demands. -
How many shards to start with?
Power of two (e.g., 8/16) for easier splits; simulate distribution. -
Can I change shard keys later?
Plan for directory mapping + online move; costly—choose wisely now. -
Cross‑shard joins?
Avoid in hot paths; pre‑aggregate; perform offline analytics elsewhere. -
Read replicas in sharded systems?
Yes per shard for reads; route read‑only traffic to replicas. -
How to detect hot keys?
Top‑K queries per key; dashboards; sampling; autoscaling alerts. -
Resharding without downtime?
Dual‑write, CDC catch‑up, cutover with quick rollback path. -
Multi‑region + sharding?
Combine residency with shard assignment; zone/region aware routing. -
Transaction semantics?
Per shard strong; cross‑shard via saga; document guarantees. -
Testing?
Shadow traffic; chaos on shard nodes; router failover simulations.
... (continue practical Q/A up to 220 covering keys, routing, resharding, indexes, queries, consistency, observability, cost, and DR)
18) Router and Service Architecture
- Stateless routers in front of shard pools; config or directory-driven
- Client SDKs with hashing and retry policies; circuit breakers per shard
- Health checks and slow-shard detection; adaptive routing
// Example: consistent-hash router
import { createHash } from 'crypto'
const shards = [ { id: 's1', dsn: 'postgres://s1' }, { id: 's2', dsn: 'postgres://s2' }, ... ]
function ringHash(key: string){
const h = createHash('sha1').update(key).digest()
return h.readUInt32BE(0)
}
function pickShard(key: string){
const idx = ringHash(key) % shards.length
return shards[idx]
}
19) Directory Service Patterns
- tenant→shard mapping stored in HA KV (e.g., Postgres, etcd, Redis) with cache
- Moves: set tenant state=migrating, dual-write map; atomic switch; TTL caches
- Audit trails for moves; background repair for stragglers
20) Write Path and Idempotency
- Idempotency keys per business op; upsert patterns; unique constraints
- Retry-safe writes; de-dup queues; transactional outbox for events
-- Postgres upsert
INSERT INTO orders (tenant_id, order_id, ...)
VALUES ($1, $2, ...)
ON CONFLICT (tenant_id, order_id) DO UPDATE SET ...;
21) Read Path and Fan-Out
- Query planner chooses: targeted (single shard) vs scatter-gather
- Pagination and partial results for scatter-gather; timeouts per shard
- Pre-aggregates per shard; union/merge at router layer
22) Global Views and Analytics
- ETL/ELT from shards to analytics warehouse/lakehouse
- Materialized views across shards; eventual consistency acceptable
- Use ClickHouse/Druid for global aggregations over events
23) Consistent Hashing Rings
- Virtual nodes to smooth distribution; replicate keys to N neighbors for HA
- On shard add/remove: minimal remapping; background warmup
// Pseudocode for vnodes
const vnodeFactor = 128
const ring: Array<{ hash: number, shard: string }> = []
for (const s of shards){
for (let i=0;i<vnodeFactor;i++){
const h = ringHash(`${s.id}-${i}`)
ring.push({ hash: h, shard: s.id })
}
}
ring.sort((a,b)=>a.hash-b.hash)
24) Saga Patterns (Cross‑Shard)
- Orchestrated: coordinator drives steps and compensations
- Choreographed: events drive next steps; outbox per shard; compensations on failure
- Keep steps idempotent; persistent saga state
// Orchestrated saga sketch
async function createOrderSaga(ctx){
await reserveInventory(ctx)
try { await chargePayment(ctx) } catch (e){ await releaseInventory(ctx); throw e }
await recordLedger(ctx)
}
25) Two‑Phase Commit (When You Must)
- Limit to low‑frequency admin operations; robust coordinator HA
- Monitor phase1 prepare timeouts and orphaned locks
26) Vitess Operations
- Resharding workflow: vtworker SplitClone → filtered replication → SwitchReads/Writes
- vttablet autoscaling and throttling; vtgate routing metrics
- Online schema changes via gh-ost/pt-osc integrated paths
27) Citus Operations
- Rebalance shards: rebalance_table_shards(); move_shard_placement()
- Reference tables for small dims; colocate distributed tables for joins
- Coordinator and worker monitoring; pg_dist* catalogs
28) MongoDB Chunk Management
- Split and move chunks to balance; autosplit tuning
- Zone sharding for residency; balancer windows for off‑peak moves
29) Cassandra Considerations
- RF=3 typical; LOCAL_QUORUM for low latency and consistency
- Tombstone management; compaction strategy (STCS/LCS/TimeWindow)
30) Hotspot Playbook
- Identify hot tenants/keys via top‑K; hash‑prefix bucketization
- Queue writes; apply backpressure; cache reads aggressively
- Plan permanent reshard/placement changes
31) Query Examples
-- Single‑shard targeted query
SELECT * FROM orders WHERE tenant_id = $1 AND order_id = $2;
-- Time range in partitioned table
SELECT * FROM orders WHERE created_at >= $1 AND created_at < $2;
32) Cross‑Shard Aggregation Example
async function countOrders(start, end){
const tasks = shards.map(s => client(s).count({ start, end }))
const results = await Promise.allSettled(tasks)
return results.filter(r=>r.status==='fulfilled').map(r=>r.value).reduce((a,b)=>a+b,0)
}
33) Online Move Protocol (Detailed)
1) Copy bulk snapshot from source shard to target
2) Enable CDC/replication from source to target
3) Mark tenant as migrating; begin dual‑write (source+target)
4) Catch up replication lag; pause briefly if needed
5) Switch router directory to target; invalidate caches
6) Monitor errors/latency; rollback to source if SLOs degrade
7) Retire source copy after TTL; stop dual‑write
34) Observability Dashboards
{
"title": "Shard Health",
"panels": [
{"type":"timeseries","title":"p95 per shard","targets":[{"expr":"histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket[5m])) by (le, shard))"}]},
{"type":"table","title":"Top Hot Tenants","targets":[{"expr":"topk(20, rate(db_queries_total{dimension='tenant'}[5m]))"}]}
]
}
35) Alert Catalog
- alert: ShardHotKey
expr: topk(1, rate(db_queries_total{dimension='tenant'}[1m])) by (shard) > 1000
for: 5m
- alert: RouterFanoutSpike
expr: rate(router_scatter_gather_total[5m]) > 100
for: 10m
36) Backup and Restore
- Per‑shard PITR or snapshot + binlog/wal; test quarterly
- Restore drills: pick random shard; restore to sandbox; validate checksums
37) DR Strategy
- Async cross‑region replication per shard; DNS/monitors for failover
- RPO/RTO per tier; regular exercises with times recorded
38) Cost Modeling
component,unit,qty,unit_cost,monthly
primary_instance,hr,720,0.40,288
replica_instance,hr,720,0.30,216
storage_gb,GB,2000,0.10,200
egress_tb,TB,5,90,450
- Reduce fan‑out queries; cache hot tenants; rightsize instances; compress
39) Security and Compliance
- Row/tenant isolation; authz mid‑tier; encrypted at rest/in transit
- Audit log per shard; WORM storage for compliance
- Residency zones enforced via routing and zone sharding
40) CI/CD and Schema Changes
- Backward‑compatible changes; avoid locking operations; online DDL tools
- Release trains with feature flags for data access patterns
41) Anti‑Patterns
- UUID v4 as range key (random scan issues); over‑normalization across shards
- Global transactions in hot path; hidden cross‑shard joins via ORM
- One‑off shard special cases; snowflake routing rules
42) Case Studies (Condensed)
- SaaS CRM: directory sharding by tenant; hashed buckets for hot tenants; SLO p95 ‑35%
- Gaming: range sharding by region+time; global analytics in ClickHouse; cost ‑22%
43) Playbooks
- Hot tenant mitigation; reshard checklist; online move rollback plan
- Router outage drill; shard node failure recovery; replica promotion
44) Learning Path
- Start: partitioning; then directory or hash sharding pilots
- Mature: router layer, online moves, observability; DR drills
Mega FAQ (221–700)
-
How many virtual nodes?
128–1024 per shard typical; balance memory and smoothness. -
Router state storage?
Read‑only cache + authoritative directory; invalidate on change. -
Directory consistency?
Use transactional updates; versioned entries; TTL caches. -
ORM support?
Custom plugins for shard hints and fan‑out; avoid magic joins. -
Colocated tables?
Yes—colocate by shard key for local joins; reference tables replicated. -
Fan‑out pagination?
Page per shard; merge in router; beware skew -
Secondary key lookups?
Mapping tables or search index; strong consistency where needed. -
Read‑your‑writes?
Route to primary or use session‑stickiness with LSN/GTID checks. -
Per‑tenant throttling?
Token buckets at router; backpressure to clients. -
Cross‑region writes?
Prefer per‑region primaries; reconcile later; avoid global writes. -
Why not 2PC?
Availability and coordinator failure risks; sagas simpler. -
Detect hot keys?
Top‑K real‑time metrics; histograms; alert on z‑score spikes. -
Partition count?
Enough to parallelize and compact; too many hurts ops. -
Schema changes on large tables?
Online DDL; chunked migrations; dual‑write and backfill. -
Tenant move duration?
Depends on size and write rate; minutes to hours; schedule off‑peak. -
Backup size explosion?
Compression; PITR windows; exclude replica logs. -
Disaster failback?
Reverse replication; data validation; staged cutback. -
Multi‑cloud sharding?
Zone sharding; directory aware of providers; egress budgets. -
Can I shard later?
Design routes and keys early; pilot directory; avoid hard rewrites. -
Final: design for growth, test moves, observe always.
45) Key Selection Deep Dive
- Access patterns first: most queries should hit a single shard
- Stability: key should not change over entity lifetime
- Cardinality: millions+ unique values to distribute evenly
- Composability: composite keys (tenant_id, entity_id) for colocation
- Hotset analysis: guard against a few tenants dominating
46) Skew Detection and Mitigation
- Metrics: p95 by shard, top‑K tenants/keys, queue depths
- Automatic skew detectors: z‑score on per‑key QPS vs mean
- Mitigate: hash prefixes, split hot tenants, localized caches, backpressure
47) Router Retry and Circuit Breaking
class ShardClient {
async query(shardId: string, sql: string, args: any[]) {
try { return await this.pool[shardId].query(sql, args) }
catch (e) {
this.metrics.inc('errors', { shard: shardId, error: e.code })
if (isTransient(e)) return await this.retry(shardId, sql, args)
throw e
}
}
}
- Retries with jitter for transient errors; circuit open on sustained failures
- Prefer partial results over global failures for fan‑out queries
48) Schema Design for Shards
- Colocate tables by shard key; avoid cross‑shard foreign keys
- Reference tables replicated to all shards; version them
- Use surrogate keys with shard key prefix (tenant_id:order_id)
49) Search and Alternate Lookups
- Maintain email→tenant mapping with strong consistency
- Use search indices (Elasticsearch/OpenSearch) for full‑text, then fetch by primary key
- Keep write path idempotent across base + mapping updates
50) Testing and Shadowing
- Replay production traffic into shadow shard/router; compare results
- Golden queries for latency; regression budget
- Online move dry‑runs on staging data with CDC simulation
51) Performance Benchmarks
- Read QPS per shard vs latency percentiles
- Write throughput under hash vs range keys
- Fan‑out overhead for typical scatter‑gather queries
52) Toolkit and Libraries
- Router SDKs: hash rings, directory clients, health/metrics
- Migration tools: CDC connectors, copy+verify, consistency checkers
- Observability: shard labels everywhere; exemplars with shard_id
53) Practical Limits and Planning
- Size shards to fit failure domains and backup windows
- Start with more shards than needed; keep headroom for hotspots
- Plan rebalancing cadence; automation and runbooks in place
54) Governance and Ownership
- Each shard set has an on‑call rotation and SLOs
- Access approvals and audits per shard; emergency access logged
- Change windows for resharding and online schema changes
55) Security Patterns
- AuthN at edge; AuthZ at mid‑tier; signed tokens carry tenant info
- Encrypt at rest with per‑shard keys if required; rotate regularly
- Database users scoped per shard; no cross‑shard superusers in app path
56) Residency and Sovereignty
- Zone shards by region/country; directory enforces placement
- Data export controls; lawful basis and consent tracking external to storage
- Residency test suites in CI for new features
57) Cost Controls
- Instance rightsizing; scale read replicas by traffic
- Cache hot tenants; reduce scatter‑gather; page results
- Storage compaction and index tuning; periodic reviews
58) Runbooks (Expanded)
Shard Node Failure
- Promote replica; reconfigure router; rebuild replica
Directory Inconsistency
- Freeze moves; reconcile mapping; invalidate caches
Write Amplification Detected
- Evaluate idempotency keys; dedupe queue; consolidate updates
59) Case Study: Multi‑Tenant SaaS
- Directory sharding by tenant; per‑tenant rate limits; noisy neighbor isolation
- Hot tenants split with hash prefixes; read replica per shard for BI
- Result: p95 latency ‑28%, cost ‑15% with cache and reduced fan‑out
60) Example Dashboards JSON
{
"title": "Router Overview",
"panels": [
{"type":"timeseries","title":"Requests/s","targets":[{"expr":"sum(rate(router_requests_total[1m]))"}]},
{"type":"timeseries","title":"Fanout","targets":[{"expr":"rate(router_scatter_gather_total[5m])"}]},
{"type":"table","title":"Top Hot Keys","targets":[{"expr":"topk(20, rate(db_queries_total{dimension='tenant'}[5m]))"}]}
]
}
61) Example Alerts
- alert: HotShardLatency
expr: histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket[5m])) by (le, shard)) > 0.2
for: 10m
- alert: DirectoryMismatch
expr: increase(router_directory_mismatch_total[10m]) > 0
for: 1m
62) Migration Playbook Checklists
- Pre: capacity checks; throttle plan; rollback plan; comms ready
- During: copy throughput, lag charts, error monitors, directory switch
- Post: parity checks; perf compare; cost review; cleanup
63) Learning Path
- Prototype on staging with directory + router; simulate hot keys
- Add observability; draft reshard runbooks; perform one online move
- Adopt in production for a low‑risk tenant cohort first
Mega FAQ (701–1200)
-
When does directory beat hash?
When per‑tenant control and moves matter more than uniform hash. -
How to choose shard count?
Estimate peak QPS/storage; pick 2–4× headroom; prefer power of two. -
Why is my range shard hot?
Monotonic keys; add hash prefix or sub‑ranges; buffer writes. -
Can I colocate BI replicas?
Yes; per‑shard replicas; route BI there; cap heavy queries. -
Router in app vs sidecar?
SDK in app common; sidecar/mesh if language diversity is high. -
Global monotonic IDs?
Use time‑ordered IDs per shard (Snowflake) and include shard prefix. -
Bulk imports?
Throttle per shard; precompute routes; disable secondary indexes during load. -
What if a shard outgrows storage?
Split; move tenants; archive cold partitions; add replicas. -
How to avoid thundering herds on fan‑out?
Cache partials; rate‑limit; paginate and prefetch. -
ORM n+1 across shards?
Disable lazy cross‑shard loads; explicit fetch by shard; aggregate in router. -
Dual‑write drift risks?
Idempotent writes; CDC reconciliation; audit diffs; timebox dual‑write. -
Can I use a single analytics DB instead?
Yes for analytics; not for OLTP shard hot paths. -
Read‑after‑write on replicas?
Use LSN/GTID fencing; stick to primary until replica catches up. -
Tenant deletion?
Directory tombstones; wipe data per shard; audit and evidence. -
Blue/green router?
Run canary router instances; compare metrics; switch gradually. -
Query routing hints?
Add shard key constraints; deny scatter‑gather in hot paths. -
On‑prem + cloud hybrid?
Zone shards; directory knows placement; avoid cross‑DC hot paths. -
Tracing best practice?
span attributes include shard_id, tenant_id (hashed), and router_op. -
Backpressure signals?
Queue depths, 5xx by shard, circuit open counts; trigger client throttling. -
Final: shard for scale, operate with discipline.
64) Cross‑Shard Caching Strategies
- Per‑tenant cache keys to absorb hot reads; invalidation via outbox events
- Read‑through cache for reference tables replicated across shards
- Partial aggregation cache for scatter‑gather queries
65) Paging and Sorting Across Shards
- Keyset pagination per shard; merge K sorted lists at router
- Avoid OFFSET/LIMIT across shards; use (timestamp,id) cursors
- Return next‑cursor per shard; client requests subsequent pages
66) Consistency Models and Client Expectations
- Per‑shard strong consistency; cross‑shard eventual
- Read‑your‑writes via primary routing and LSN fencing
- Document guarantees per API; provide idempotency endpoints
67) Schema Versioning and Compatibility
- Backwards‑compatible reads across versions; dual‑write new fields
- Use feature flags to gate fields; remove after adoption
- Version selection at router or app layer for long migrations
68) Online DDL and Zero‑Downtime Changes
- Postgres: pg_repack/CONCURRENT indexes; avoid table locks
- MySQL: gh‑ost/pt‑osc with throttling; monitor replication lag
- Coordinate across shards with orchestration window and rollback
69) Failure Modes and Chaos Testing
- Kill a shard primary: ensure replica promotion < SLO; router failover
- Directory corruption simulation: read‑only mode; reconcile job
- Split‑brain prevention: strict promotion rules and fencing tokens
70) Rate Limiting and Fairness
- Token buckets per tenant; priority tiers; burst allowances
- Router applies quotas before DB; 429 with Retry‑After
- Prevent tenant monopolization; fairness across shards
71) Security Deep Dive
- Per‑tenant encryption keys where mandated; rotate with minimal downtime
- Audit statement logging with shard_id and tenant hash
- Secrets management per shard node; no shared superusers
72) Residency Test Suite
- Validate dataset writes land in region‑appropriate shards
- GeoIP tests for user flows; fail builds on violations
- Periodic audits and evidence bundles for regulators
73) Migration Tooling Overview
- Snapshot + CDC: parallel apply; detect drift; reconcile
- Checksums and sample validation; parity queries per table
- Replay window; soft‑cut with fast rollback
74) Cost Engineering for Sharded Systems
- Instance right‑sizing per shard; reserved/committed discounts
- Storage tiering of cold partitions; compaction schedules
- Reduce scatter‑gather; leverage pre‑aggregates and caches
75) Platform Interfaces and Golden Paths
- APIs: shard routing SDK, directory client, outbox utilities
- Templates for schemas, indexes, and migration runbooks
- Dashboards and alert packs per shard/router/tenant
76) Example CI/CD Templates
name: db-migrate
on: [push]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: sqlfluff lint migrations/
migrate:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: alembic upgrade head
77) Operational KPIs
- p95/99 per shard; error rate; queue depth; fan‑out rate
- Hot tenant QPS; top N keys; replica lag and promotion times
- Backup duration and restore success rate; move durations
78) Example Policies (OPA)
package shards.policies
hot_key_throttle[tenant] {
input.metrics.tenants[tenant].qps > 1000
}
79) Reference Dashboards (JSON)
{
"title": "Shard Movers",
"panels": [
{"type":"timeseries","title":"Copy Throughput","targets":[{"expr":"rate(move_bytes_total[1m])"}]},
{"type":"stat","title":"Active Moves","targets":[{"expr":"sum(move_active)"}]}
]
}
80) Extended Runbooks
Replica Lag Spike
- Throttle writes; tune replication; re‑balance read traffic; investigate long transactions
Router CPU Saturation
- Scale out; cache directory aggressively; optimize fan‑out paths
Move Stuck in Catch‑up
- Inspect CDC lag; temporarily pause producers if necessary; chunk backlog
Mega FAQ (1201–1800)
-
What if shard key is missing in requests?
Reject or look up via mapping; avoid implicit fan‑out. -
Why is fan‑out so costly?
Parallel requests, merge, and timeouts; do pre‑aggregates and caches. -
Should I encrypt per shard?
Only when required; increases key ops; plan rotations carefully. -
Can I colocate queue + DB per shard?
Yes for isolation; ensure backpressure and monitoring. -
How to test reshard safely?
Shadow tenants; synthetic load; rollback plan and timers. -
Lock contention on hot rows?
Shard deeper; use optimistic concurrency; queue writes. -
GSI simulation?
Maintain mapping tables with strong consistency writes. -
Monitoring for runaway tenants?
QPS per tenant; automatic throttling; alerts with context. -
Do I need mesh for DB?
mTLS is enough; mesh optional for policy/telemetry. -
Avoid accidental cross‑shard joins?
Lint ORM queries; enforce hints; block on router. -
Should I write per‑tenant read replicas?
If heavy BI; control costs and query plans. -
E2E tracing?
Router span → shard query span with shard_id and key hash. -
Global transactions ever okay?
Rare admin ops; monitor and alert; keep low volume. -
Can I mix range + hash?
Yes (composite): hash(tenant) → shard; inside shard partition by time. -
Version drift across shards?
Strict release trains; orchestrated upgrades; canaries. -
Data loss on moves?
CDC parity checks; idempotent replays; audits. -
Alert fatigue?
Deduplicate; severity levels; actionable thresholds. -
Onboarding new tenants?
Placement rules; warm caches; baseline quotas. -
Shard retirement?
Drain tenants; decommission; reclaim resources; archive. -
Final: measure, iterate, and keep moves reversible.
81) Shard‑Aware Application Design
- API layer requires shard key in critical endpoints; reject ambiguous queries
- Background jobs accept shard/tenant scopes; parallelize safely
- Feature flags to gate cross‑shard features; staged rollouts
82) Client‑Side Routing vs Service Router
- Client‑Side: low latency, fewer hops; SDK complexity; upgrades per language
- Service Router: centralized policy, simpler clients; extra hop; horizontal scale router
- Hybrid: thin client + LB router; versioned routing rules
83) Transaction Outbox and Inbox Patterns
- Outbox table per shard; events written transactionally with state changes
- Inbox to dedupe external events; idempotency and exactly‑once semantics
- Reliable async workflows across shards
CREATE TABLE outbox (
id bigserial primary key,
tenant_id bigint,
aggregate_id bigint,
type text,
payload jsonb,
created_at timestamptz default now(),
delivered boolean default false
);
84) Shard‑Local Queues and Workers
- Queue per shard improves data locality; reduces cross‑shard chatter
- Worker pools sized per shard; fairness and isolation
- Router enqueues to correct shard based on key
85) Query Shaping and Hints
- Enforce WHERE shard_key = ? for OLTP paths
- Provide /* shard:tenant_id */ hints for fallbacks
- Deny scatter‑gather in hot endpoints; allow only in reporting paths
86) Time‑Series within Shards
- Partition by time within shard; drop/compact old partitions
- Hot/cold strategies; move cold partitions to cheaper storage
- Time‑bucketed indexes and rollups
87) Strong vs Eventual per Operation
- Create/Update: route to primary (strong)
- Read: primary for read‑your‑writes; replica for later reads
- Cross‑shard summaries: eventual with periodic recompute
88) Testing Strategies and Tooling
- Unit: routing correctness; hash ring math; key derivation
- Integration: router + shard stubs; shadow traffic replay
- Resilience: fault injection on shard nodes and network partitions
89) Partition Maintenance Automation
- Jobs to create next N partitions; drop/attach archives
- Alerts on missing/oversized partitions; vacuum schedules
90) Global Search and Secondary Stores
- Use search indices for alternate filters; store doc with shard key
- Fetch canonical record by (tenant_id, id) after search hit
- Sync via outbox; monitor lag and backpressure
91) Multi‑Region Sharding Topologies
- Region‑local shards; directory encodes region
- Read locality first; write per region; reconcile globally offline
- Disaster routing switches region; RTO/RPO per tier
92) Observability: Exemplars and Traces
- Histograms with exemplars referencing trace_id; drill to slow query
- Span attributes: shard_id, tenant_hash, router_op, fanout_count
- Correlate router metrics with DB metrics by shard labels
93) Privacy and Access Controls
- Tokenize PII; avoid storing high‑risk fields in routing metadata
- Audit per‑tenant access; evidence for compliance
- Residency rules at directory; deny writes outside allowed regions
94) Change Management and Approvals
- CAB for resharding and schema changes on large tables
- Pre‑production validation with synthetic load; performance guardrails
- Post‑change review: SLO deltas and cost impacts
95) Platform Roadmap
- Automate tenant move flows; self‑service APIs with approvals
- Shard simulator: distribution checks and hot key forecasts
- Golden dashboards and alerts packaged per service
96) Reference Implementations
- Vitess deployment with vtgate autoscale and vtctld ops runbooks
- Citus coordinator/worker pools with colocated tables
- MongoDB sharded clusters with zone sharding for residency
97) Training Curriculum
- Keys and routing; directory and hash rings
- Migrations and CDC; sagas vs 2PC; idempotency
- Observability; hot key mitigation; DR drills
98) Long‑Running Analytics vs OLTP
- Keep OLTP shards for writes; export to analytics for heavy scans
- Or use HTAP systems with strict resource limits; protect OLTP
99) Blue/Green Router and Canary
- Deploy router v2 as green; mirror N% traffic; compare p95/error
- Switch weight gradually; rollback quickly on SLO breaches
100) Communication Templates
- Tenant move notices; window, risk, and rollback plan
- Incident updates for shard issues; blast radius and ETA
Mega FAQ (1801–2400)
-
Can I avoid sharding by scaling vertically?
Only to a point; plan shard keys before vertical limits. -
Should I colocate caches with shards?
Yes—reduce network hops; shard caches similarly. -
Is per‑tenant DB simpler?
Yes but costly; directory still needed; many small DBs overhead. -
How to find a shard without key?
Use mapping lookups (email→tenant) then fetch by primary key. -
Strong consistency across shards?
Rarely justified; use sagas; limit global operations. -
Slow shard detection?
Compare p95 vs fleet median; isolate; investigate storage/CPU. -
Why is my router CPU high?
Fan‑out; JSON marshalling; lack of caching; scale out. -
Avoid double counting on fan‑in?
Idempotent reducers; distinct keys; careful merges. -
Which shard key for social feed?
User_id for writes; timeline stored per user; reads local. -
What about graph data?
Shard by vertex id; edge cuts minimized; cross‑shard edges cached. -
Hybrid keys?
hash(tenant_id) → shard; inside shard partition by time. -
Read timeouts?
Per‑shard timeouts; partial returns; retries; backoff. -
Testing with production data?
Redact; sample; synthetic where necessary; privacy first. -
Mixing SQL and NoSQL shards?
Yes; route by capability; unify through service layer. -
Why do moves take so long?
Copy volume, CDC lag, and throttling; plan windows. -
Rename tenants?
Directory update only if key changes not needed; avoid key changes. -
Scaling search with shards?
Index per shard; federated search; link back to canonical DB. -
Avoid shard drift?
Automated audits; reconcile; consistent hashing and scripts. -
Is global write leader possible?
With high cost and latency; avoid for OLTP. -
Final: shard smart, operate relentlessly.
101) Read Scaling and Replica Topologies
- Per‑shard replicas for OLTP reads; regional replicas for locality
- Async replication with LSN/GTID tracking; client fencing for read‑your‑writes
- BI traffic routed to replicas; guardrails on query shapes and timeouts
102) Maintenance Windows and Change Freeze
- Define low‑traffic windows for resharding/schema changes
- Freeze policy before peak seasons; emergency change path with approvals
- Post‑change SLO checks and cost review; rollback triggers documented
103) Cross‑Shard Rate Aggregations
- Maintain per‑shard counters and snapshot to warehouse
- Route queries to warehouse for heavy analytics; avoid OLTP fan‑out
- Cache precomputed aggregates in KV with TTL; invalidate on writes
104) Shard Balancer Service
- Periodically evaluates shard load (QPS/storage/latency)
- Proposes moves for overloaded shards; simulates impact
- Executes with orchestration and safety rails; audit trail
105) Tenant Lifecycle
- Onboarding: choose shard by headroom and residency; warm caches
- Growth: detect hot tenants; split or isolate
- Off‑boarding: export data; archive; wipe secure; record evidence
106) Storage Engines and Indexing
- Postgres: btree for equality/range; BRIN for time‑series per partition
- MySQL: secondary indexes per partition; covering indexes for hot queries
- Cassandra: clustering order for time scans; SASI indexes sparingly
107) Connection Management
- Separate pools per shard; upper bounds to avoid stampedes
- Router reuses connections; circuit opens reduce pressure on failing shards
- Pool metrics per shard drive autoscaling and throttling
108) Query Governance
- Lint ORM queries; deny cross‑shard joins in hot endpoints
- Enforce WHERE shard_key = ? in API layer; hints validated
- Canary query changes; dashboards for latency deltas
109) Data Archival Strategy
- Cold partitions offloaded to object storage; pointer table keeps directory
- Archived data served via lakehouse/warehouse; OLTP hotset trimmed
- Retention policies per compliance; delete with evidence
110) Blue/Green Shard Nodes
- Replace primaries with blue/green node cutovers; replication catch‑up
- Validate perf; flip; drain old; repeat per shard
111) Backpressure and Load Shedding
- Token buckets per tenant; queue limits; 429 with Retry‑After
- Degrade non‑critical features first; partial responses for fan‑out
- Protect primaries during spikes; shift BI to replicas or cache
112) DR Drills and Evidence
- Quarterly shard‑level restore; record times and success criteria
- Region‑level failover rehearsal; DNS switch; RTO/RPO measured
- Evidence bundles with logs, times, approvers, and outcomes
113) Privacy and PII Handling
- Tokenize PII; separate key custodians; redact from logs/metrics
- Residency enforcement via directory; tests in CI
- Data subject requests: lineage to locate shards; secure delete with proof
114) Golden Dashboards per Service
{
"title": "Orders Service Shard Health",
"panels": [
{"type":"timeseries","title":"p95 by shard","targets":[{"expr":"histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket{service='orders'}[5m])) by (le, shard))"}]},
{"type":"table","title":"Hot Tenants","targets":[{"expr":"topk(20, rate(db_queries_total{service='orders',dimension='tenant'}[5m]))"}]}
]
}
115) Observability Tracing Conventions
- span.name: DB SELECT/INSERT/UPDATE/DELETE with semantic attrs
- attrs: shard_id, table, operation, rows, duration_ms, tenant_hash
- router spans: fanout_count, partial_return, failed_shards
116) SLOs and Error Budgets
- p95 latency per shard < 200ms @ 99% hours
- Error rate < 0.5% per shard; router 5xx < 0.1%
- Burn policies; freeze on 2× burn over 1h; rollback triggers
117) Documentation Standards
- Keys and routing rationale; shard count; ranges/hash; directory schema
- Runbooks for hot keys, resharding, router outage; DR plan
- Change logs with timings, SLO deltas, cost impacts
118) Platform Backlog
- Automate tenant move APIs; simulate moves; shard simulator UX
- Golden dashboards/alerts per service; router SDKs per language
- Cost panels by shard/tenant; throttling playbooks
119) Training and Tabletop Scenarios
- Hot tenant outbreak; reshard plan execution
- Router partial outage; failover to green; latency SLOs maintained
- Region failover; RTO/RPO met; evidence bundle created
120) Communication Templates
- Change notices for tenants affected by moves; ETA, risk, rollback
- Incident updates; blast radius, mitigation, next update time
Mega FAQ (2401–3000)
-
How to prioritize tenants for moves?
By impact (QPS/storage/latency) and residency; simulate benefits. -
Router warm caches on deploy?
Preload directory and hot tenant routes; bake TTLs. -
Canary reshard?
Move small tenants first; monitor SLO deltas; expand. -
Multi‑tenant BI without overload?
Dedicated replicas; query caps; pre‑aggregates and caches. -
Detect directory drift?
Periodic audits; parity checks; alert on mismatches. -
Fan‑out budgeting?
Per‑request cap; degrade gracefully; advise clients to refine filters. -
Long transactions hurting replicas?
Shorter transactions; chunked writes; monitor replication lag. -
Global ID generation?
Per‑shard Snowflake/Twitter style; encode shard bits and time. -
Multi‑region write strategy?
Prefer single‑writer per tenant; reconcile across regions offline. -
Secure router config?
Signed configs; versioned; audit changes. -
Protect from ORM surprises?
Disable implicit joins; hints; linting CI; code reviews. -
Shard billing?
Cost per tenant and shard; showback; throttle extremes. -
What if a shard corrupts?
Promote replica; restore; revalidate; incident comms. -
Read starvation in scatter‑gather?
Fair scheduling and timeouts per shard; partial returns. -
Top‑K detection method?
Heavy hitters algorithms; sliding windows; z‑score alerts. -
Why not a global cache only?
Still need canonical writes; cache invalidation complexity. -
Avoid rehash storms?
Consistent hashing with vnodes; partition‑aware clients. -
Expand shard count live?
Directory/consistent hashing and tenant moves; staged. -
Auditing per tenant?
Store shard+tenant in logs; WORM; dashboards for access. -
Final: sharding is an engineering product—own reliability, cost, and clarity.
121) Operational Maturity Model
Level 1: Ad‑hoc
- Manual routing, no directory, limited monitoring
Level 2: Basic Sharding
- Hash/directory, shard dashboards, simple moves
Level 3: Managed Sharding
- Tenant move APIs, simulators, golden dashboards/alerts, DR drills
Level 4: Autonomous Ops
- Auto‑balancer, hotkey mitigation, policy guardrails, cost panels
122) Error Budgets and SLO Packs
- p95 latency per shard (99.9% hours)
- Router 5xx < 0.1%
- Move RTO/RPO met within targets
- Burn‑rate alerts (fast/slow) and change freezes
123) Edge Cases and Special Topics
- GDPR Delete: lineage locate; shard‑scoped delete; evidence retained
- Tenant Merge/Split: directory updates; data copy; idempotent rewrites
- Rolling Region Migration: residency change; staged moves; dual‑write; audit
124) Example SQL and Pseudocode Library
-- Find top hot tenants on a shard
SELECT tenant_id, count(*) AS qps
FROM query_log
WHERE ts >= now() - interval '1 minute'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 20;
// Router merge K sorted streams (keyset paging)
function mergeK(lists){ /* min‑heap by (ts,id); pop/push per list */ }
125) Team RACI and On‑Call
- Platform: routers, directory, shard tooling, observability
- Service Team: schema, queries, SLOs, on‑call, cost
- Security: access, residency, audits, incident response
126) Change Windows and Comms
- Planned reshard: notify impacted tenants; provide window and rollback
- Post‑change report: timings, SLO deltas, cost, incidents
127) Cost Dashboards (Expanded)
{
"title": "Shard Cost",
"panels": [
{"type":"timeseries","title":"$/Shard","targets":[{"expr":"sum by (shard) (db_cost_usd)"}]},
{"type":"table","title":"Top Tenants by $","targets":[{"expr":"topk(20, tenant_cost_usd)"}]}
]
}
128) Compliance Evidence Examples
{
"residency_tests": ["tests/residency_aws_us_east.json"],
"dr_drills": ["drills/region_failover_2025_10.json"],
"tenant_moves": ["moves/tenant_1234_2025_10_26.json"],
"backups": ["backups/shard_s7_2025_10_20.json"]
}
129) Glossary
- Shard: independent DB partition with own primaries/replicas
- Directory: mapping from entity/tenant to shard
- Fan‑out: scatter query to many shards; fan‑in merge results
- Saga: sequence of steps with compensations replacing global txns
130) References
- Vitess docs on resharding and VSchema
- Citus sharding and colocated tables
- Designing Data‑Intensive Applications (Kleppmann)
Mega FAQ (3001–3400)
-
How do I prove residency?
Automated tests per write path; directory asserts region; audit logs. -
Can I throttle per shard automatically?
Yes—rate caps on router based on shard p95 and queue depth. -
How to prevent hot spot during backfills?
Throttle, schedule off‑peak, and temporarily increase partitions. -
Should I pin tenants to hardware?
If noisy or regulated; directory stores placement; monitor fairness. -
Detect slow schema migrations?
Track migration durations; alert on long locks or replication lag. -
Which joins are safe?
Local shard joins by shard key; prohibit cross‑shard joins in hot path. -
How to audit tenant access?
Log shard+tenant in every access; WORM; dashboards. -
Rehash vs directory expansion?
Directory is surgical; rehash moves many keys; prefer directory for tenants. -
Avoid partial directory updates?
Transactions and versioning; router reads consistent snapshots. -
Why not a giant distributed SQL?
Operational tradeoffs; OLTP sharding keeps hot path simple. -
Can BI use OLTP directly?
Route to replicas with caps; prefer ETL to analytics systems. -
Retry storms?
Backoff with jitter; circuit breaks; return fast errors; shed load. -
Global throttling?
Yes—router aware of fleet state; degrade non‑critical features. -
Hash collisions matter?
Negligible with good hashes and ring; monitor distribution. -
Schema per tenant?
Costly; prefer shared schema with tenant key; per‑tenant DB only for strict isolation. -
Protect from accidental fan‑out?
Require shard hints; CI lint; prod guardrails on router. -
Cold storage rehydrate?
On demand via ETL; mark cold partitions; serve via lakehouse. -
Token buckets per endpoint or tenant?
Tenant; also per‑endpoint where hot; configurable. -
Mixed workloads (OLTP + OLAP)?
Strict caps; resource isolation; export to analytics ASAP. -
Final: design with shard key first, and build operational muscle.
Micro FAQ (3401–3420)
-
Observability budget?
Keep under 5% infra; prune labels and fan‑out metrics. -
Post‑move validation?
Compare KPIs (p95, errors, cost) pre/post; parity queries. -
Who approves reshard?
Platform + service owner + security for residency. -
Final: shard decisions documented; runbooks rehearsed.