GraphQL Federation for Microservices and API Gateways (2025)
Federation enables teams to own subgraphs while exposing a unified API. This guide focuses on practical schema design and operations.
Executive summary
- Define clear ownership and boundaries; avoid cross-subgraph coupling
- Persisted queries and operation registry for safety/perf; cache where stable
- Monitor resolver costs; implement DDoS and complexity limits
Subgraphs and composition
- Entity references; keys; value types; composed schema pipelines
Gateway concerns
- AuthN/Z; caching; persisted queries; complexity analysis; timeouts/retries
Deployment
- Versioned subgraphs; canary; automated composition checks; contracts
Observability
- Trace resolvers; cost budgets; rate limiting per client/operation
FAQ
Q: When to choose federation vs monolithic GraphQL?
A: Federation for large orgs with clear domain ownership; monolith for small teams to avoid overhead.
1) What Is Federation?
- Split a single graph across subgraphs owned by domain teams
- Compose at a router/gateway into a single supergraph
2) Core Components
- Router/Gateway (Apollo Router/Federation, GraphQL Mesh, Helix, Mercurius)
- Subgraphs (domain services) exposing GraphQL schemas with federation directives
- Schema registry and composition pipeline
3) Entities and Keys
# Example subgraph entity
type User @key(fields: "id") {
id: ID!
name: String
}
4) Reference Resolution
// __resolveReference in subgraph
export const User = {
__resolveReference(ref: { id: string }, ctx: Ctx) {
return ctx.users.byId(ref.id);
}
};
5) Composition and Contract
- Use schema registry (Apollo/GraphOS or open-source) to validate
- Contracts hide fields/types for specific clients
6) Query Planning
- Router splits query across subgraphs; stitches results
- Optimize with @requires, @provides, and proper entity boundaries
7) Preventing N+1
// DataLoader pattern per field
const userLoader = new DataLoader(ids => batchGetUsers(ids));
8) Caching
- CDN for GET persisted queries
- Router cache for query plans and results (TTL + cache hints)
- Edge caches per viewer for personalization
9) APQ and Persisted Queries
- APQ reduces payload; persisted queries lock down operations
- Operation registry with safelist; block arbitrary queries in prod
10) AuthN/Z
- JWT/OAuth at edge; propagate identity to subgraphs via headers or context
- Field-level auth: directive-based checks; schema-policy integration
- Multi-tenant scoping: orgId in context; enforce in resolvers
11) Rate Limiting and Abuse
- Edge/L7 limits per token/IP; router-level query cost budgets
- Subgraph local limits for hotspots
12) Complexity and Depth Limits
// cost map per type/field; block expensive operations; per-tenant budgets
13) Errors and Retries
- Distinguish user vs system errors; partial data with errors array
- Retry idempotent subgraph calls with backoff; circuit breakers on failures
14) Observability
- OpenTelemetry traces: edge → router → subgraphs; include operation names
- Metrics: p95 latency, error rate, cache hit, planner misses, subgraph fanout
- Logs: redacted variables, request IDs, client name/version
15) Schema Governance
- PR checks: composition, breaking changes, deprecations
- Contract tests: consumer-driven; example queries validated
16) CI/CD Flow
- Subgraph schema lint → publish to registry → compose → router rollout
- Canary new composition; rollback on composition or runtime regressions
17) Router Config (Sketch)
[supergraph]
subgraphs = ["users", "orders", "catalog"]
[cors]
origins = ["https://app.example.com"]
[telemetry]
otlp = { endpoint = "otel:4317" }
[caching]
plan_ttl = "5m"
18) Subgraph Examples
extend type User @key(fields: "id") {
id: ID! @external
orders: [Order] @requires(fields: "id")
}
19) Dataloaders at Subgraphs
export const resolvers = {
Query: { user: (_, { id }, ctx) => ctx.users.byId(id) },
User: { orders: (u, _, ctx) => ctx.orders.byUserIds.load(u.id) }
};
20) Edge Caching with CDNs
- GET persisted queries only
- Vary on auth/tenant headers if necessary; use cache hints
21) Security
- Disallow introspection in prod for public clients; allow for trusted admin
- Input validation; query cost ceilings; depth limits; timeouts
- PII minimization; field-level encryption where necessary
22) Versioning and Deprecations
- Avoid versioning the graph; use deprecations and contracts
- Remove after deprecation window with usage telemetry
23) Migration Playbooks
- REST → GraphQL: facade router; strangler pattern; move domains incrementally
- Monolith → subgraphs: extract entities one domain at a time; measure latency
24) Testing Strategy
- Unit: resolvers and policies
- Integration: router + subset of subgraphs with mock HTTP
- Contract: example operations validated against composed schema
- E2E: critical app flows with persisted queries
25) Failure Modes
- Subgraph outage: partial data; error isolation; fallback content
- Composition failure: block rollout; alert; fix schema conflicts
- Hot path blowups: query cost guard; cache; paginate
26) Templates and Repos
subgraphs/users
schema.graphql
src/
subgraphs/orders
router/
supergraph.graphql (generated)
router.toml
27) Mega FAQ (1–400)
-
Should I start with federation?
No—start with a single graph; federate when teams/domain boundaries are clear. -
How many subgraphs?
As many as team ownership and latency allow—avoid excessive fanout. -
N+1 in router or subgraph?
Fix in subgraph with dataloaders; router should not hide subgraph inefficiency. -
Do I need a registry?
Yes for safe composition, contracts, and usage insights.
...
JSON-LD
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "GraphQL Federation and API Gateway (2025)",
"description": "Production-grade guide to GraphQL federation: routers, subgraphs, composition, caching, security, observability, and operations.",
"datePublished": "2025-10-28",
"dateModified": "2025-10-28",
"author": {"@type":"Person","name":"Elysiate"}
}
</script>
Related Posts
CTA
Need a resilient, fast federated graph? We architect supergraphs, implement routers, and harden subgraphs end‑to‑end.
Appendix A — Supergraph Architecture
- Router as a stateless layer; horizontal scale; fast startup; hot reload of supergraph
- Subgraphs own domain models; strict boundaries; clear entity ownership
- Registry for composition and contracts; CI gates for safe rollout
Appendix B — Router Configuration Patterns
# apollo-router.toml (sketch)
[supergraph]
listen = "0.0.0.0:4000"
[cors]
origins = ["https://app.example.com"]
allow_credentials = true
[headers]
forward = ["authorization", "x-tenant-id", "x-request-id"]
[telemetry]
exporter = "otlp"
endpoint = "otel:4317"
[caching]
plan_cache_ttl = "10m"
result_cache_ttl = "30s"
[timeouts]
overall = "3s"
subgraph = "2.5s"
Appendix C — Composition Workflow and Registry
- Subgraph PR: schema lint → publish to registry as draft → composition check
- Contract variants per client → hide internal fields and federated joins
- Composition promotion: canary router loads new supergraph; observe KPIs
Appendix D — Caching and Hints
# Cache control hints
extend schema @cacheControl(defaultMaxAge: 60)
type Product @key(fields: "id") {
id: ID!
name: String @cacheControl(maxAge: 300)
price: Money @cacheControl(scope: PRIVATE)
}
- Edge cache GET persisted queries; vary on viewer keys if needed
- Router result cache + per-field cache hints; subgraph-level HTTP caching
Appendix E — Operation Registry and Persisted Queries
- Safelist persisted operations; block ad-hoc queries in prod
- Client name/version required; roll out new ops via registry publish
- CDN stores GET /?hash=... with long TTL + revalidation
Appendix F — Query Cost and Depth Guards
type CostMap = Record<string, number>; // type.field → cost
function estimateCost(selectionSet: any, costMap: CostMap): number {
// Walk AST; sum costs; account for list multipliers
return 0; // impl omitted
}
// Reject if cost > tenantBudget or depth > maxDepth
Appendix G — Auth Patterns
- Edge: verify JWT; attach claims to context; reject expired/invalid
- Router: propagate headers; optional field-level policies via directives
- Subgraph: enforce domain authorization; never rely solely on router checks
Appendix H — Multi-Tenancy
- tenantId in context; row-level filters in subgraphs
- Cache segmentation by tenant; rate limits per tenant and client
Appendix I — Schema Contracts and Variants
- Base supergraph for internal; contract variants for mobile/web/partners
- Hide unstable fields; deprecation windows tracked via usage metrics
Appendix J — Consistency and Latency
- Eventual consistency across subgraphs; avoid cross-service transactions
- Use entity views; subscribe or poll for refresh; expose updatedAt
Appendix K — Subscriptions and Real-Time
# Router passes through websocket; subgraphs stream events
type Subscription {
orderUpdated(id: ID!): OrderEvent!
}
- Prefer server push for high-value events; throttle rate; backpressure
Appendix L — Defer/Stream and Batching
query ProductPage @defer {
product(id: "p1") {
id
name
reviews @defer { body rating }
}
}
- Stream lists; defer expensive fields; reduce TTFB
- Batch subgraph calls with Dataloaders; coalesce per-request
Appendix M — Retries, Timeouts, and Circuit Breaking
- Subgraph HTTP timeouts < router overall timeout; retry idempotent GETs
- Circuit break noisy subgraphs; partial responses with error annotations
- SLOs: p95 < 200ms per subgraph; error rate < 1%
Templates
# Kubernetes deployment for router (sketch)
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 4
template:
spec:
containers:
- name: router
image: ghcr.io/org/router:1.0.0
ports: [{ containerPort: 4000 }]
env:
- { name: OTEL_EXPORTER_OTLP_ENDPOINT, value: http://otel:4317 }
- { name: SUPERGRAPH_FILE, value: /etc/supergraph.graphql }
Appendix N — Federation v2 Directives and Patterns
# Common directives: @key, @requires, @provides, @shareable, @inaccessible
# Example
type Product @key(fields: "id") {
id: ID!
name: String @shareable
seller: Seller @requires(fields: "id")
}
- Prefer @key on natural identifiers when stable; otherwise synthetic IDs
- Use @requires to fetch local fields needed for downstream resolution
- Limit @provides to clear, stable contracts; avoid tight coupling
Appendix O — Entity Design
- One owning subgraph per entity; others extend only
- Keep entities small; expose views for heavy aggregates
- Avoid cyclic ownership; model via references and views
Appendix P — Router Resilience
- Timeouts per subgraph; overall request time budget
- Circuit breakers on error spikes; partial responses with error nodes
- Backoff retries only for idempotent fetches; never for mutations
Appendix Q — Caching and CDN Strategy
- Persisted GET only at edge; Vary by auth/tenant
- Router result cache keyed by operation + variables + viewer scope
- Use cache hints; private scope for user data; short TTLs for hot paths
Appendix R — Persisted Queries Enforcement
- Operation registry with safelist; block unknown hashes in prod
- Deployment gates: router loads only approved op set per client version
Appendix S — Auth and Tenancy Patterns
- Edge verifies JWT; attach tenantId, roles
- Subgraphs enforce domain auth; row filters by tenantId
- Field-level directives for sensitive data; audit access
Appendix T — Subscriptions, Defer, and Stream
- Subscriptions for high-value events; throttle and backpressure
- Defer/Stream: reduce TTFB by streaming non-critical fields/lists
- Clients must handle incremental payloads robustly
Appendix U — Security
- Disable introspection for public clients; enforce query cost/depth
- Sanitize errors; redact variables in logs; input validation
- SSRF prevention in subgraphs; outbound egress allowlists
Appendix V — Observability Deep Dive
- Trace IDs from edge; router span names = operation:client
- Attributes: subgraph, planStep, costEstimate, cacheHit
- Metrics: planner_miss, fanout, subgraph_p95, error_rate, cache_hit_rate
Appendix W — CI/CD Workflows
name: federation
on: [pull_request]
jobs:
check-subgraph:
steps:
- uses: actions/checkout@v4
- run: npm ci && npm run lint:schema && npm run test
publish-schema:
if: github.ref == 'refs/heads/main'
steps:
- run: npx rover subgraph publish org@current --name users --schema ./schema.graphql
compose-and-canary:
steps:
- run: npx rover supergraph compose --profile strict > supergraph.graphql
- run: ./deploy_router_canary.sh supergraph.graphql
Appendix X — Testing Strategy Details
- Router integration: mock subgraphs via HTTP fixtures
- Subgraph integration: real DB containers; dataloaders; auth hooks
- Contract: example ops validated against composed graph in CI
- Load tests: persisted ops with realistic variable distributions
Appendix Y — Migration Playbooks
- Monolith → subgraphs: extract entity at a time; proxy unknown fields to monolith
- REST → graph: facade subgraph translating to REST; deprecate endpoints gradually
Appendix Z — Examples
extend type Order @key(fields: "id") {
id: ID! @external
total: Money @provides(fields: "currency")
currency: String @external
}
Appendix AA — Gateway Resiliency Patterns
- Timeouts per subgraph; hedged requests sparingly; circuit breakers
- Partial responses for non-critical fields; user-facing fallbacks
- Bulkhead isolation: limit concurrency per subgraph
- Brownout mode: omit expensive fields under load via @defer/@stream
Appendix AB — Request Shaping and Query Hints
- Encourage clients to request minimal sets; provide profiles (lite/full)
- Use named fragments to standardize shapes; registry validates usage
Appendix AC — CDN Integration
- Only persisted GET at edge; POST → router; vary headers: auth, tenant, locale
- Stale-while-revalidate for semi-static fields; purge on mutations
Appendix AD — Planner Optimization
- Coalesce entity fetches; prefer fewer hops; align @requires with data model
- Avoid cross-subgraph fanout on hot paths; denormalize via @provides where safe
Appendix AE — Error Taxonomy
- USER_INPUT, AUTH, PERMISSION, NOT_FOUND, RATE_LIMIT
- SYSTEM, TIMEOUT, UPSTREAM, PARTIAL_DATA, UNKNOWN
- Map consistently at router; reserve extensions for machine handling
Appendix AF — Security and Privacy
- PII minimization; access logs redacted; privacy review for new fields
- Threat model: query abuse, caching leaks, introspection misuse
Appendix AG — Subgraph Boundaries
- Owners maintain single source of truth; avoid duplicate business logic
- Shared libraries for auth/context only; no cross-domain data coupling
Appendix AH — Contracts and Variants
- Variant per client/app version; hide unstable fields
- Deprecate with usage telemetry; remove after SLO window
Appendix AI — Operational Budgets
- Router p95 < 120ms; planner miss rate < 1%; cache hit > 60%
- Subgraph p95 < 200ms; error < 1%; fanout < 5 per op median
Appendix AJ — Multi-Region and Failover
- Global router with geo routing; local subgraphs per region
- Cross-region failover for reads; write affinity home region
Appendix AK — DataLoader Playbook
// Keyed by viewer scope when necessary to avoid cache bleed
const byId = new DataLoader(ids => batch(ids), { cacheKeyFn: toViewerScopedKey });
Appendix AL — Persisted Operations Rollout
- Phase 1: log-only unknown ops; Phase 2: warn; Phase 3: block
- Emergency allowlist for break-glass ops with expiry
Appendix AM — Complexity Budgeting
- Per-tenant budgets; VIP tiers; adjust by concurrency and historical usage
- Dynamic budgets during incidents to shed load safely
Appendix AN — Subscriptions Topology
- Shared websocket layer; subgraphs via event bus; backpressure and limits
Appendix AO — Defer/Stream UX
- Show skeleton for deferred parts; stabilize layout to avoid CLS
Appendix AP — Observability: PromQL
# Router error rate
sum(rate(router_requests_total{status=~"5.."}[5m])) / sum(rate(router_requests_total[5m]))
# Subgraph p95
histogram_quantile(0.95, sum(rate(subgraph_request_duration_seconds_bucket[5m])) by (le, subgraph))
# Planner miss
sum(rate(router_planner_miss_total[5m]))
Appendix AQ — Grafana Dashboard (Sketch JSON)
{
"title": "Supergraph Overview",
"panels": [
{"type":"stat","title":"Router p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(router_request_duration_seconds_bucket[5m])) by (le))"}]},
{"type":"table","title":"Subgraph Errors","targets":[{"expr":"sum(rate(subgraph_requests_total{status=~'5..'}[5m])) by (subgraph)"}]}
]
}
Appendix AR — Alert Library (YAML)
- alert: RouterHighErrorRate
expr: (sum(rate(router_requests_total{status=~"5.."}[5m])))/(sum(rate(router_requests_total[5m]))+1e-9) > 0.02
for: 10m
- alert: SubgraphLatencyP95High
expr: histogram_quantile(0.95, sum(rate(subgraph_request_duration_seconds_bucket[5m])) by (le, subgraph)) > 0.300
for: 10m
Appendix AS — CI Gates
- composition: fail on breaking changes and invalid references
- contracts: block removal of contracted fields used by clients
- persisted ops: require registry publish before router deploy
Appendix AT — Example Repos
repos/
users-subgraph/
orders-subgraph/
catalog-subgraph/
supergraph-router/
contracts/
Appendix AU — Case Study: Checkout
- Query fetches cart (catalog), user (users), prices/tax (pricing), inventory
- Defer reviews and recommendations
- Canary: drop recs under load; preserve checkout core path
Appendix AV — Error Handling Examples
// Router error mapping
if (err.code === 'ECONNABORTED') classify('TIMEOUT');
Appendix AW — Pagination and Connections
type Connection { edges: [Edge!]!, pageInfo: PageInfo! }
Appendix AX — Federation with Mesh/Mercurius
- Mesh to unify REST/GraphQL/gRPC into graph; use as subgraph source
- Mercurius Federation v2 for Node-based subgraphs
Appendix AY — Security: SSRF, RCE, Injection
- Strict outbound allowlists; sanitize URLs; avoid dynamic eval; validate inputs
Appendix AZ — Runbooks
- Spike in 5xx: identify subgraph; circuit break; reduce query budgets; rollback
- Planner miss surge: warm caches; review op churn; deploy plan cache increase
- Cache stampede: introduce jitter TTL; precompute hot paths
Mega FAQ (401–1000)
-
Do we encrypt at field level?
For highly sensitive fields; store ciphertext; decrypt at viewer edge. -
Should we block introspection?
For public clients, yes; allow for trusted admin behind auth. -
How to keep router hot?
Warm plan and result caches; pin CPU; avoid GC pauses; keep binaries slim. -
Why do we see N+1 at router?
It’s subgraph N+1 surfacing; fix with dataloaders and batch APIs. -
Pagination best practice?
Connections with opaque cursors; avoid offset for large datasets. -
Can we stream errors?
Use incremental delivery; attach errors to path; keep UX stable. -
How to phase out a subgraph?
Stop extensions; move ownership; mark deprecated; remove after usage=0. -
Query cost vs depth?
Use cost map; depth alone is insufficient. -
Multi-tenant guardrails?
Per-tenant budgets, rate limits, caches; audit access. -
Final: prefer clarity, constrain costs, measure constantly.
Appendix BA — Client Guidance and Operation Hygiene
- Use named operations and fragments; include client name/version headers
- Prefer persisted queries; avoid ad-hoc POST in production
- Request minimal fields; defer non-critical sections; paginate large lists
- Consider offline caching and stale-while-revalidate patterns
Appendix BB — Router Horizontal Scaling
- Stateless router replicas behind L4; sticky only for websockets if needed
- Preload supergraph; watch registry for changes; SIGHUP or hot-reload
- Limit max concurrent per replica; tune threadpool/event-loop
Appendix BC — Supergraph Lifecycle
- Draft composition → canary → 25% → 100%; abort on error budget burn
- Rollback by pinning previous supergraph; invalidate router caches
- Track composition ID, commit SHAs, owner, change ticket
Appendix BD — Mutations and Side-Effects
- Keep mutations in owning subgraph; avoid multi-subgraph transactional semantics
- Emit domain events; eventual consistency for projections
- Idempotency keys for retries; return stable mutation payload shapes
Appendix BE — File Uploads and Binary Data
- Prefer signed URLs via REST; graph returns metadata and URLs
- Limit upload size; virus/malware scanning; audit trails
Appendix BF — Partial Data UX
- Render available fields with skeletons; show inline warnings for gaps
- Avoid blocking primary flows on optional subgraph outages
Appendix BG — Schema Style Guide
- Nouns for types, verbs for mutations; consistent pagination
- Use ISO-8601 timestamps; Money type with currency; IDs opaque
- Avoid leaking storage/DB fields; present domain-centric names
Appendix BH — Federation with Legacy Backends
- Subgraph as facade: translate to REST/gRPC/DB; cache hot lookups
- Stabilize latency with bulk endpoints; prefer batched loaders
Appendix BI — Error Extensions Contract
{
"errors": [
{
"message": "Not authorized",
"extensions": {
"code": "PERMISSION",
"subgraph": "orders",
"requestId": "..."
},
"path": ["order", "total"]
}
]
}
Appendix BJ — CORS and Edge Security
- Strict origins; credentials only when needed; same-site cookies preferred
- HSTS, CSP; block mixed content; validate referer/origin on sensitive ops
Appendix BK — Subgraph Read Models
- Pre-compute aggregates; materialized views; denormalize sparingly
- Keep entity ownership; expose read-friendly fields
Appendix BL — Safe Defaults
- Limit depth to 10; cost budget per op; request timeout 3s
- Disable schema introspection for public clients; enable for admin
Appendix BM — Canary Recipes
- Shadow traffic to new subgraph version; compare error/latency
- Gradual router rollout of new supergraph; per-client cohort testing
Appendix BN — Edge Authorization Tokens
- Short-lived JWTs; rotate keys; anti-replay; audience and scope validated
Appendix BO — Latency Budgets
- Router < 120ms p95; subgraph < 200ms p95 each; cumulative budget by path
- Automatic brownouts when over budget: defer/omit optional fields
Appendix BP — SDL Lint Rules
- Disallow null list elements; consistent naming; no dangling types
- Require descriptions on public fields; deprecations include rationale
Appendix BQ — Plan Cache Warmers
- Periodically execute top persisted queries to warm plan/result caches
- Bust on composition changes; record hit/miss metrics
Appendix BR — Multi-Tenant Abstractions
- Tenant-aware dataloaders; tenant-scoped caches; tenant quotas and limits
Appendix BS — Backpressure and Load Shedding
- Queue caps; 429 with Retry-After; degrade non-critical fields
Appendix BT — Secrets and Config
- Router and subgraphs via secure env vars or secrets; no secrets in SDL
- Rotate credentials regularly; audit access
Appendix BU — Infra Templates (Kubernetes)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 4
maxReplicas: 20
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
Appendix BV — Gateway Feature Flags
- Enable/disable features by client or cohort; test in prod safely
Appendix BW — Data Privacy Regions
- Route EU users to EU subgraphs; ensure data residency with policy
Appendix BX — Operator Runbooks
- Router CPU spike → check plan miss, request surge, cache stampede
- Subgraph 5xx spike → circuit break, page owner, roll back latest change
Appendix BY — Docs and Developer Experience
- Self-serve SDL portal; examples; persisted op explorer; performance tips
Appendix BZ — Cost Controls
- Cache hot reads; batch; avoid over-fetch; suppress heavy optional fields
Observability: Panel JSON (Sketch)
{
"title": "Federation Health",
"panels": [
{"type":"stat","title":"Router p95"},
{"type":"table","title":"Subgraph 5xx by svc"},
{"type":"graph","title":"Plan cache hit"}
]
}
Policies (Pseudo-OPA)
package graph.policy
violation["no_latest_images"] {
input.kind == "Deployment"
some c
c := input.spec.template.spec.containers[_]
endswith(c.image, ":latest")
}
Templates: Persisted Ops Enforcement (Edge)
location /graphql {
if ($request_method = GET) { # persisted only
proxy_pass http://router;
}
if ($request_method = POST) { return 405; }
}
Case Study — Profile Page
- User (users), purchases (orders), recommendations (recs)
- Defer recs; cache user; stream purchases
Mega FAQ (1001–1200)
-
Do we allow arbitrary queries in prod?
No—persisted queries only for public clients; log/deny unknown. -
How big should a subgraph be?
Team-sized, domain-aligned; avoid too fine-grained fragmentation. -
Who owns entity keys?
The entity’s owning subgraph; others must not redefine keys. -
Can we have cross-DB transactions?
Avoid; use events and compensation. -
Should we expose IDs or slugs?
Opaque IDs for references; slugs as fields where useful.
Mega FAQ (1201–1400)
-
Why depth limits are insufficient?
Low-depth fields can be expensive; use cost maps. -
Can router rewrite queries?
Prefer not; only for technical normalization; never change semantics. -
Where to authorize?
At edge for identity, in subgraphs for domain constraints. -
Can we cache POST?
Not at CDN; router internal caches okay with safe keys. -
Should we enable introspection?
For trusted tools; disable for public paths.
Mega FAQ (1401–1600)
-
How to sunset a field?
Deprecate, track usage, communicate, remove after window. -
When to use @provides?
When one subgraph can reliably supply a field owned by another with clear contract. -
Handle heavy lists?
Paginate, limit, and stream; precompute where possible. -
Why plan cache misses?
New ops or variable shapes; warm caches; standardize client fragments.
Mega FAQ (1601–1800)
-
Should we allow mutations across subgraphs?
Keep in single owner; orchestrate via events. -
Rate limit at edge or router?
Edge for coarse limits; router for per-op budgets. -
Can we run router at the edge?
Yes with wasm/native binaries; test cold starts and limits. -
Final: keep graphs lean, plans hot, and costs bounded.
Appendix CA — Gateway Deployment Patterns
- Sidecar telemetry; init to fetch supergraph; read-only FS; non-root user
- Blue/green router rollout; health gates on p95 and error rate
- Multi-tenant router fleets for isolation and quota enforcement
Appendix CB — Supergraph Delivery
- Signed supergraph bundles; checksum verification; fallback to last-good
- Progressive rollout by region and client cohort
Appendix CC — Schema Lint Pack
- Require descriptions; ban ID-as-Int; enforce Money scalar usage
- Disallow nullable lists of non-nullable elements unless justified
Appendix CD — Router Warmup and Preload
- Preload top persisted ops; synthetic traffic warmers; plan cache hydrate
Appendix CE — Hot Paths Catalog
- Identify top 20 operations by volume; publish budgets and owners
- Quarterly review of hot path latency and cost
Appendix CF — Error Budgets and Freeze
- Burn >2x for 60m: freeze router and subgraph deploys; focus on recovery
Appendix CG — Redaction and Privacy Tests
- Unit tests: variables redacted; logs scrubbed; traces tagged without PII
Appendix CH — Client Contract Testing
- Validate example queries per client against composed graph; diff on changes
Appendix CI — Schema Change Classes
- Safe: new types/fields with defaults; new queries
- Risky: required input fields; enum removals; field removals
- Block: breaking changes without deprecation window and 0 usage
Appendix CJ — Router Rate Limiting
- Token bucket per client and tenant; budgets per operation class
- 429 with Retry-After; degrade optional fields under pressure
Appendix CK — Subgraph SLOs and Ownership
- Each subgraph: latency/error SLOs; on-call; dashboards; runbooks
Appendix CL — Observability Fields
- route=operationName, clientName, variant, subgraph, planSteps, fanout
Appendix CM — Replay and Backfills
- Use idempotent loaders; replay cached persisted ops to warm downstream caches
Appendix CN — Edge Rules
- Block GraphQL playground in prod; CSP strict; CORS allowlist
Appendix CO — Capturing Usage
- Per-field usage; deprecation candidates; client impact analysis
Appendix CP — Threat Model Highlights
- Abuse via expensive queries; cache key leaks; SSRF through resolvers
- Mitigate with cost ceilings, redaction, outbound allowlists
Appendix CQ — Synthetic Monitors
- Canary persisted ops with thresholds; alert on regressions
Appendix CR — Chaos Scenarios
- Kill one subgraph; check brownout UX and partial data resilience
- Router rollout with bad bundle; ensure last-good fallback
Appendix CS — DR and Failover
- Router multi-region anycast; subgraphs active/active or active/passive
- Test failover quarterly with evidence
Appendix CT — Access Patterns
- Public vs private graphs; partner variants; admin ops separated
Appendix CU — Governance Dashboards
- Composition frequency; breaking change attempts; usage of deprecated fields
Appendix CV — Cost Dashboards
- Router CPU/mem per op; subgraph cost per k requests; cache ROI
Appendix CW — Developer Tooling
- CLI to generate persisted ops; schema diffs; usage reports per team
Appendix CX — Education
- Playbooks for defer/stream, pagination, caching, security
Appendix CY — Backward Compatibility Windows
- 60–90 days typical; faster with client feature flags and phased rollout
Appendix CZ — Final Principles
- Stable contracts, bounded costs, observable systems, and fast, safe delivery
Operations Runbooks (Extended)
Incident: Router 5xx spike
- Compare to client/version; check subgraph breakdown; circuit break worst
- Reduce budgets; enable brownout; roll back supergraph if composition changed
Incident: Cache stampede
- Increase TTL with jitter; warmers; protect backend with concurrency caps
Incident: Planner miss surge
- Identify new ops; enforce persisted; pre-warm; client guidance
Dashboards (Sketch JSON)
{
"title": "GraphQL Router",
"panels": [
{"type":"stat","title":"p95","targets":[{"expr":"histogram_quantile(0.95,sum(rate(router_request_duration_seconds_bucket[5m])) by (le))"}]},
{"type":"graph","title":"Cache Hit","targets":[{"expr":"sum(rate(router_cache_hit_total[5m]))/sum(rate(router_cache_total[5m]))"}]},
{"type":"table","title":"Subgraph Errors","targets":[{"expr":"sum(rate(subgraph_requests_total{status=~'5..'}[5m])) by (subgraph)"}]}
]
}
Mega FAQ (1801–2000)
-
Can we auto-defer under load?
Yes—brownout mode defers/omits optional fields based on budgets. -
Should we log variables?
Redacted only; never PII; sample with care. -
Why partial data errors upset clients?
Educate and standardize UI patterns; keep core flows intact. -
How to bound cost for partners?
Contracts + budgets + rate limits + persisted-only.
Mega FAQ (2001–2200)
-
Do we pin digests for router image?
Yes for prod; signed; provenance verified. -
Multi-cloud supergraph?
Possible with registry sync and region-local subgraphs. -
How to find deprecation targets?
Field usage telemetry sorted by zero-usage period. -
Are mutations cacheable?
No; but mutation results can prime read caches.
Mega FAQ (2201–2400)
-
What breaks composition most?
Key mismatches, conflicting type ownership, incompatible field nullability. -
How to test subscriptions at scale?
Synthetic publishers; backpressure; soak tests; fanout metrics. -
Should we expose enums?
Yes with caution; include UNKNOWN; version with care. -
Final: keep graphs clean, fast, and secure—optimize for maintainability.
Appendix DA — Supergraph Change Windows
- Define weekly windows for high-risk composition changes
- Auto-pause canary during incidents; require explicit resume
Appendix DB — Client Upgrade Strategy
- Contract variants per app version; deprecate with telemetry
- Feature flags map to fields; phased rollout by cohort
Appendix DC — Router Sandbox Mode
- Dry-run new supergraph in parallel; compare plans and latencies
- Toggle per-client cohort; no user impact during validation
Appendix DD — Subgraph Health Contracts
- Liveness: DB connectivity, cache reachability
- Readiness: migrations applied, warm caches, dependencies healthy
Appendix DE — Gateway Canary SLOs
- p95 latency within 10% of baseline; error rate delta < 0.5%
- Plan cache hit within 5% of baseline after 10 minutes
Appendix DF — Traffic Shaping
- Shift by client, region, or op-class; protect hot paths first
- Brownout overrides for optional fields; defer under load
Appendix DG — Subgraph Outage Playbook
- Circuit break; mark fields as deferred/omitted; communicate status
- Warm fallback caches where possible; page on-call and track MTTR
Appendix DH — Writer Isolation
- Route writes to home region; read replicas elsewhere; expose version stamps
- Avoid multi-subgraph transactions; orchestrate via events
Appendix DI — Cost Attribution
- Attribute router CPU/mem per operation and client
- Attribute subgraph cost by op fanout and latency; report to owners
Appendix DJ — SDL Conventions
- Use @deprecated with reason; include removal date
- Avoid leaking storage keys; opaque IDs with node lookup when needed
Appendix DK — Query Shape Catalog
- Catalog top shapes; provide named fragments; enforce via lints
Appendix DL — Subgraph Release Trains
- Batch subgraph releases weekly; align with supergraph comps
- Reduce composition churn and planner cache misses
Appendix DM — Router Resource Profiles
- CPU-bound vs IO-bound; adjust threads; pre-alloc arenas; GC tuning
Appendix DN — Persistence and Caches
- Redis/memcached for router result cache; LRU; per-tenant partitions
Appendix DO — Schema Diff Bots
- PR bot annotates diffs, breaking risks, usage impact, owners to review
Appendix DP — SLA and Budgets Dashboard
{
"title": "Supergraph SLOs",
"panels": [
{"type":"stat","title":"Router SLO"},
{"type":"stat","title":"Subgraph Error Budget Burn"}
]
}
Appendix DQ — Access Reviews
- Quarterly review of client access, variants, and scopes
- Revoke stale keys; rotate credentials
Appendix DR — Secrets and Key Management
- JWKS rotation; HSM/KMS for signing; key expiry policies
Appendix DS — Regionalization and Residency
- Split subgraphs per region; forbid cross-region PII via policy
Appendix DT — Playground and Tooling
- Internal dev-only; pre-auth; rate limits; operation registry integration
Appendix DU — Observability Annotations
- Tag spans with op class (hotpath|bulk|admin), client tier, cost estimate
Appendix DV — Rollback Evidence
- Store comparison charts for before/after; attach to incident and PR
Appendix DW — Partner Integrations
- Contract variants; SLA per partner; sandbox environments; sample data
Appendix DX — Education Path
- 101 GraphQL; 201 Federation; 301 Observability and Cost; 401 Security
Appendix DY — Data Classifications
- PUBLIC, INTERNAL, CONFIDENTIAL, SECRET; field annotations; policy gates
Appendix DZ — Final Operating Principles
- Contracts first; costs bounded; caches warm; rollouts safe; evidence always
Extended Templates
# Money scalar
scalar Money
# Node interface
interface Node { id: ID! }
# Router Deployment hardened
securityContext:
runAsNonRoot: true
readOnlyRootFilesystem: true
// Router middleware pseudo
function beforeResolve(ctx) {
ctx.vars.requestStart = Date.now();
ctx.vars.cost = estimateCost(ctx.operation);
if (ctx.vars.cost > ctx.budget) throw new Error('COST_EXCEEDED');
}
Mega FAQ (2401–2600)
-
Why do plan cache misses spike?
New ops/contracts; warmers; reduce op churn; guide client fragments. -
Can we proxy REST as a subgraph?
Yes via Mesh or facade; ensure batching and caching to avoid N+1. -
Is GraphQL suitable for all endpoints?
Not binary uploads; keep those in REST; graph returns metadata. -
How to detach a subgraph?
Move ownership; deprecate fields; drain traffic; remove from supergraph.
Mega FAQ (2601–2800)
-
Should public clients have separate router?
Prefer separate fleet with stricter policies and caching. -
How to test composition failures?
Inject schema conflicts in staging; verify gates block deploys. -
Are enums safe for partners?
Include UNKNOWN; evolve carefully; provide contracts per partner. -
When to pagination vs windowing?
Paginate user lists; window analytics; stream where appropriate.
Mega FAQ (2801–3000)
-
Defer/stream impact on SEO?
Server-render critical content; hydrate incrementals; use placeholders. -
Should we cap list sizes?
Yes; hard caps; client-friendly errors; links to next pages. -
Is GraphQL cacheable at CDN?
Persisted GET only; vary keys strictly; purge on mutations. -
Final: simplicity wins—clean schemas, stable ops, and tight ops.
Mega FAQ (3001–3200)
-
Cost budgets per tier?
Yes: free, pro, enterprise with increasing caps and support. -
Run router at edge?
Possible; ensure plan cache warm, memory caps, and cold-start tests. -
Multi-tenant fairness?
Weighted queues, per-tenant budgets, strict isolation for heavy tenants. -
Last word: observable, predictable, and efficient graphs at scale.
Appendix EA — Partner Sandbox and Throttles
- Separate variants and routers; synthetic data; strict budgets and persisted-only
Appendix EB — Budgeted Operations Catalog
- Define allowed ops per client tier; publish limits; auto-annotate violations
Appendix EC — Evidence Automation
- Attach dashboards, traces, composition diffs, and PRs to every rollout
Appendix ED — Education and Checklists
- Pre-merge: schema lint, composition ok, deprecations reviewed, cost within budget
- Pre-release: caches warm, canary plan, rollback steps, owners on-call
Appendix EE — Closing Notes
- Federate when teams are ready; keep contracts lean; measure and iterate
Final FAQ (3201–3400)
-
Should we allow inline fragments everywhere?
Yes, but encourage named fragments for reusability and cacheability. -
Are schema stitches via REST OK?
Yes via Mesh/facades with batching; watch latency and error surfaces. -
Can we expose GraphiQL in prod?
Not for public; internal behind auth and rate limits only. -
When to split a subgraph?
When ownership or scaling diverges; avoid ping-ponging entities. -
Final: stable supergraph, fast router, efficient subgraphs—own your slice well.
Quick Reference
- Persisted ops only for public
- Bound costs and depths
- Warm plan/result caches
- Observe p95, error, fanout
- Defer/stream non-critical fields
Troubleshooting Index
- High p95: check subgraph; cache hit; plan miss; hot paths
- Many 5xx: classify errors; circuit break; rollback recent changes
- Composition fail: key conflicts; nullability; directives mismatch
Additional FAQ (3401–3600)
-
How to handle client cache invalidation?
Return entity version fields; use cache policies; purge on mutation. -
Fallback when a subgraph is slow?
Defer optional fields; set timeouts; render partial data with notices. -
Do we expose node interface?
When helpful for universal fetches; ensure opaque IDs. -
Final: contracts, costs, caches, and care.
Closing
Federation succeeds when domain ownership is clear, contracts are stable, and router/subgraphs are observable and resilient. Keep the supergraph lean, plan caches warm, and client operations disciplined.
Appendix EF — Gateway Cold Start Strategy
- Pre-bake router image with dependencies; lazy-load optional modules
- Warm plan/result caches after deploy using top persisted ops
- Keep supergraph bundle local with checksum verification
Appendix EG — Tenant Fairness and Quotas
- Weighted fair queue per tenant; enforce per-op and per-minute budgets
- Distinct queues for hot paths vs bulk analytics to avoid starvation
Appendix EH — Error Budgets per Path
- Track error budgets per operation class; freeze risky schema changes on burn
- Route suspicious traffic to lower-cost variants during incidents
Appendix EI — Subgraph Cache Contracts
- Define cache invariants for read models; TTLs; invalidation hooks on mutations
- Expose version fields and lastUpdated timestamps for client-side reconciliation
Appendix EJ — Router Memory Hygiene
- Cap result size; stream large lists; enforce max response bytes
- Tune alloc arenas; monitor GC pauses; pre-size buffers for hot paths
Appendix EK — Partner Governance
- Separate partner variants with strict schemas and cost ceilings
- Contract SLAs, deprecation calendars, and emergency kill switches
Appendix EL — Security Headers and TLS
- HSTS, CSP (report-only then enforce), COOP/COEP for isolation
- mTLS to subgraphs; pin CA; rotate certs automatically
Appendix EM — Router Feature Flags
- Gradually enable planner optimizations, caching strategies, and brownout rules
- Flags are persisted and audited; roll back instantly if regress
Appendix EN — Cost Visibility for Teams
- Dashboard cost per field and operation; show top offenders per team
- Quarterly reviews to reduce over-fetch and normalize shapes
Appendix EO — Data Residency Enforcement
- Policy denies field access outside region; annotate SDL with residency class
- Router selects region-local subgraphs; fallback only for public data
Appendix EP — SDL Documentation Automation
- Generate docs per variant; include usage charts and deprecation timelines
- Link to persisted operations catalog and client examples
Appendix EQ — Blue/Green Subgraphs
- Run v1 and v2 side-by-side; router targets cohort; compare KPIs
- Drain old after success; archive metrics and composition snapshot
Appendix ER — Incident Drill Catalog
- Subgraph timeout storm; router cache miss spike; partner abuse burst
- Each drill has playbook, metrics targets, and evidence bundle template
Appendix ES — Access Tokens and JWKS
- Rotate signing keys; cache JWKS with short TTL; fail closed on mismatch
- Support key rollover with overlapping validity windows
Appendix ET — Plan Cache Engineering
- Keyed by signature and stable variable shapes; bucketized for memory caps
- Evict LRU; protect hot entries; export hit/miss and evictions
Appendix EU — Result Cache Engineering
- Respect cache-control hints; per-tenant segments; compression on
- Invalidate on mutation topics; jitter TTLs to avoid herds
Appendix EV — Subgraph Sandboxes
- Ephemeral envs for schema experiments; compose against staging supergraph
- Synthetic data sets; abuse testing; performance benchmarks
Appendix EW — Ops CLI
- Commands: diff-supergraph, warm-caches, dump-hotpaths, block-op, lift-freeze
Appendix EX — Education Tracks
- Clients: operation hygiene and caching; Subgraphs: data loaders and auth
- Operators: observability deep dive; Security: privacy and attack surfaces
Appendix EY — Privacy Reviews
- Checklist per new field: classification, retention, residency, purpose
- Automated scanning to flag risky names and patterns (e.g., ssn, email)
Appendix EZ — Final Practices
- Own domains; keep contracts minimal; budget costs; measure relentlessly
Mega FAQ (3601–4000)
-
Why do we still see timeouts with cached ops?
Subgraph latency dominates; cache helps router but not slow sources. -
Should we allow multi-hop entity chains?
Limit; deep chains explode latency. Flatten with views or denormalization. -
How to guard against schema sprawl?
Lint packs, ownership, review gates, and usage-based pruning. -
Can clients pick variants dynamically?
Yes via headers; enforce allowlists; migrate gradually. -
How to handle rogue clients?
Block at edge; revoke keys; quarantine ops; notify owners. -
Router CPU spikes after deploy?
Plan cache cold; warmers; reduce composition churn; profile GC. -
Are batched resolvers always better?
Usually, but watch peak memory; cap batch sizes; stream results. -
Should we allow file uploads through graph?
Prefer signed URLs via REST; keep graph metadata-only. -
Federation vs BFFs?
Federation centralizes contracts; BFFs can coexist as clients of the graph. -
Final: clarity, constraints, and continuous care keep graphs healthy.