GraphQL Federation for Microservices and API Gateways (2025)

Oct 26, 2025•

graphqlfederationapi-gatewaymicroservices

•

Federation enables teams to own subgraphs while exposing a unified API. This guide focuses on practical schema design and operations.

Executive summary

Define clear ownership and boundaries; avoid cross-subgraph coupling
Persisted queries and operation registry for safety/perf; cache where stable
Monitor resolver costs; implement DDoS and complexity limits

Subgraphs and composition

Entity references; keys; value types; composed schema pipelines

Gateway concerns

AuthN/Z; caching; persisted queries; complexity analysis; timeouts/retries

Deployment

Versioned subgraphs; canary; automated composition checks; contracts

Observability

Trace resolvers; cost budgets; rate limiting per client/operation

FAQ

Q: When to choose federation vs monolithic GraphQL?
A: Federation for large orgs with clear domain ownership; monolith for small teams to avoid overhead.

1) What Is Federation?

Split a single graph across subgraphs owned by domain teams
Compose at a router/gateway into a single supergraph

2) Core Components

- Router/Gateway (Apollo Router/Federation, GraphQL Mesh, Helix, Mercurius)
- Subgraphs (domain services) exposing GraphQL schemas with federation directives
- Schema registry and composition pipeline

3) Entities and Keys

# Example subgraph entity
 type User @key(fields: "id") {
   id: ID!
   name: String
 }

4) Reference Resolution

// __resolveReference in subgraph
export const User = {
  __resolveReference(ref: { id: string }, ctx: Ctx) {
    return ctx.users.byId(ref.id);
  }
};

5) Composition and Contract

Use schema registry (Apollo/GraphOS or open-source) to validate
Contracts hide fields/types for specific clients

6) Query Planning

Router splits query across subgraphs; stitches results
Optimize with @requires, @provides, and proper entity boundaries

7) Preventing N+1

// DataLoader pattern per field
const userLoader = new DataLoader(ids => batchGetUsers(ids));

8) Caching

- CDN for GET persisted queries
- Router cache for query plans and results (TTL + cache hints)
- Edge caches per viewer for personalization

9) APQ and Persisted Queries

APQ reduces payload; persisted queries lock down operations
Operation registry with safelist; block arbitrary queries in prod

10) AuthN/Z

- JWT/OAuth at edge; propagate identity to subgraphs via headers or context
- Field-level auth: directive-based checks; schema-policy integration
- Multi-tenant scoping: orgId in context; enforce in resolvers

11) Rate Limiting and Abuse

- Edge/L7 limits per token/IP; router-level query cost budgets
- Subgraph local limits for hotspots

12) Complexity and Depth Limits

// cost map per type/field; block expensive operations; per-tenant budgets

13) Errors and Retries

- Distinguish user vs system errors; partial data with errors array
- Retry idempotent subgraph calls with backoff; circuit breakers on failures

14) Observability

- OpenTelemetry traces: edge → router → subgraphs; include operation names
- Metrics: p95 latency, error rate, cache hit, planner misses, subgraph fanout
- Logs: redacted variables, request IDs, client name/version

15) Schema Governance

- PR checks: composition, breaking changes, deprecations
- Contract tests: consumer-driven; example queries validated

16) CI/CD Flow

- Subgraph schema lint → publish to registry → compose → router rollout
- Canary new composition; rollback on composition or runtime regressions

17) Router Config (Sketch)

[supergraph]
subgraphs = ["users", "orders", "catalog"]

[cors]
origins = ["https://app.example.com"]

[telemetry]
otlp = { endpoint = "otel:4317" }

[caching]
plan_ttl = "5m"

18) Subgraph Examples

extend type User @key(fields: "id") {
  id: ID! @external
  orders: [Order] @requires(fields: "id")
}

19) Dataloaders at Subgraphs

export const resolvers = {
  Query: { user: (_, { id }, ctx) => ctx.users.byId(id) },
  User: { orders: (u, _, ctx) => ctx.orders.byUserIds.load(u.id) }
};

20) Edge Caching with CDNs

GET persisted queries only
Vary on auth/tenant headers if necessary; use cache hints

21) Security

- Disallow introspection in prod for public clients; allow for trusted admin
- Input validation; query cost ceilings; depth limits; timeouts
- PII minimization; field-level encryption where necessary

22) Versioning and Deprecations

Avoid versioning the graph; use deprecations and contracts
Remove after deprecation window with usage telemetry

23) Migration Playbooks

- REST → GraphQL: facade router; strangler pattern; move domains incrementally
- Monolith → subgraphs: extract entities one domain at a time; measure latency

24) Testing Strategy

- Unit: resolvers and policies
- Integration: router + subset of subgraphs with mock HTTP
- Contract: example operations validated against composed schema
- E2E: critical app flows with persisted queries

25) Failure Modes

- Subgraph outage: partial data; error isolation; fallback content
- Composition failure: block rollout; alert; fix schema conflicts
- Hot path blowups: query cost guard; cache; paginate

26) Templates and Repos

subgraphs/users
  schema.graphql
  src/
subgraphs/orders
router/
  supergraph.graphql (generated)
  router.toml

27) Mega FAQ (1–400)

Should I start with federation?
No—start with a single graph; federate when teams/domain boundaries are clear.
How many subgraphs?
As many as team ownership and latency allow—avoid excessive fanout.
N+1 in router or subgraph?
Fix in subgraph with dataloaders; router should not hide subgraph inefficiency.
Do I need a registry?
Yes for safe composition, contracts, and usage insights.

...

JSON-LD

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "GraphQL Federation and API Gateway (2025)",
  "description": "Production-grade guide to GraphQL federation: routers, subgraphs, composition, caching, security, observability, and operations.",
  "datePublished": "2025-10-28",
  "dateModified": "2025-10-28",
  "author": {"@type":"Person","name":"Elysiate"}
}
</script>

CTA

Need a resilient, fast federated graph? We architect supergraphs, implement routers, and harden subgraphs end‑to‑end.

Appendix A — Supergraph Architecture

- Router as a stateless layer; horizontal scale; fast startup; hot reload of supergraph
- Subgraphs own domain models; strict boundaries; clear entity ownership
- Registry for composition and contracts; CI gates for safe rollout

Appendix B — Router Configuration Patterns

# apollo-router.toml (sketch)
[supergraph]
listen = "0.0.0.0:4000"

[cors]
origins = ["https://app.example.com"]
allow_credentials = true

[headers]
forward = ["authorization", "x-tenant-id", "x-request-id"]

[telemetry]
exporter = "otlp"
endpoint = "otel:4317"

[caching]
plan_cache_ttl = "10m"
result_cache_ttl = "30s"

[timeouts]
overall = "3s"
subgraph = "2.5s"

Appendix C — Composition Workflow and Registry

- Subgraph PR: schema lint → publish to registry as draft → composition check
- Contract variants per client → hide internal fields and federated joins
- Composition promotion: canary router loads new supergraph; observe KPIs

Appendix D — Caching and Hints

# Cache control hints
extend schema @cacheControl(defaultMaxAge: 60)

type Product @key(fields: "id") {
  id: ID!
  name: String @cacheControl(maxAge: 300)
  price: Money @cacheControl(scope: PRIVATE)
}

- Edge cache GET persisted queries; vary on viewer keys if needed
- Router result cache + per-field cache hints; subgraph-level HTTP caching

Appendix E — Operation Registry and Persisted Queries

- Safelist persisted operations; block ad-hoc queries in prod
- Client name/version required; roll out new ops via registry publish
- CDN stores GET /?hash=... with long TTL + revalidation

Appendix F — Query Cost and Depth Guards

type CostMap = Record<string, number>; // type.field → cost

function estimateCost(selectionSet: any, costMap: CostMap): number {
  // Walk AST; sum costs; account for list multipliers
  return 0; // impl omitted
}

// Reject if cost > tenantBudget or depth > maxDepth

Appendix G — Auth Patterns

- Edge: verify JWT; attach claims to context; reject expired/invalid
- Router: propagate headers; optional field-level policies via directives
- Subgraph: enforce domain authorization; never rely solely on router checks

Appendix H — Multi-Tenancy

- tenantId in context; row-level filters in subgraphs
- Cache segmentation by tenant; rate limits per tenant and client

Appendix I — Schema Contracts and Variants

- Base supergraph for internal; contract variants for mobile/web/partners
- Hide unstable fields; deprecation windows tracked via usage metrics

Appendix J — Consistency and Latency

- Eventual consistency across subgraphs; avoid cross-service transactions
- Use entity views; subscribe or poll for refresh; expose updatedAt

Appendix K — Subscriptions and Real-Time

# Router passes through websocket; subgraphs stream events
 type Subscription {
   orderUpdated(id: ID!): OrderEvent!
 }

- Prefer server push for high-value events; throttle rate; backpressure

Appendix L — Defer/Stream and Batching

query ProductPage @defer {
  product(id: "p1") {
    id
    name
    reviews @defer { body rating }
  }
}

- Stream lists; defer expensive fields; reduce TTFB
- Batch subgraph calls with Dataloaders; coalesce per-request

Appendix M — Retries, Timeouts, and Circuit Breaking

- Subgraph HTTP timeouts < router overall timeout; retry idempotent GETs
- Circuit break noisy subgraphs; partial responses with error annotations
- SLOs: p95 < 200ms per subgraph; error rate < 1%

Templates

# Kubernetes deployment for router (sketch)
apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 4
  template:
    spec:
      containers:
        - name: router
          image: ghcr.io/org/router:1.0.0
          ports: [{ containerPort: 4000 }]
          env:
            - { name: OTEL_EXPORTER_OTLP_ENDPOINT, value: http://otel:4317 }
            - { name: SUPERGRAPH_FILE, value: /etc/supergraph.graphql }

Appendix N — Federation v2 Directives and Patterns

# Common directives: @key, @requires, @provides, @shareable, @inaccessible
# Example
 type Product @key(fields: "id") {
   id: ID!
   name: String @shareable
   seller: Seller @requires(fields: "id")
 }

- Prefer @key on natural identifiers when stable; otherwise synthetic IDs
- Use @requires to fetch local fields needed for downstream resolution
- Limit @provides to clear, stable contracts; avoid tight coupling

Appendix O — Entity Design

- One owning subgraph per entity; others extend only
- Keep entities small; expose views for heavy aggregates
- Avoid cyclic ownership; model via references and views

Appendix P — Router Resilience

- Timeouts per subgraph; overall request time budget
- Circuit breakers on error spikes; partial responses with error nodes
- Backoff retries only for idempotent fetches; never for mutations

Appendix Q — Caching and CDN Strategy

- Persisted GET only at edge; Vary by auth/tenant
- Router result cache keyed by operation + variables + viewer scope
- Use cache hints; private scope for user data; short TTLs for hot paths

Appendix R — Persisted Queries Enforcement

- Operation registry with safelist; block unknown hashes in prod
- Deployment gates: router loads only approved op set per client version

Appendix S — Auth and Tenancy Patterns

- Edge verifies JWT; attach tenantId, roles
- Subgraphs enforce domain auth; row filters by tenantId
- Field-level directives for sensitive data; audit access

Appendix T — Subscriptions, Defer, and Stream

- Subscriptions for high-value events; throttle and backpressure
- Defer/Stream: reduce TTFB by streaming non-critical fields/lists
- Clients must handle incremental payloads robustly

Appendix U — Security

- Disable introspection for public clients; enforce query cost/depth
- Sanitize errors; redact variables in logs; input validation
- SSRF prevention in subgraphs; outbound egress allowlists

Appendix V — Observability Deep Dive

- Trace IDs from edge; router span names = operation:client
- Attributes: subgraph, planStep, costEstimate, cacheHit
- Metrics: planner_miss, fanout, subgraph_p95, error_rate, cache_hit_rate

Appendix W — CI/CD Workflows

name: federation
on: [pull_request]
jobs:
  check-subgraph:
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm run lint:schema && npm run test
  publish-schema:
    if: github.ref == 'refs/heads/main'
    steps:
      - run: npx rover subgraph publish org@current --name users --schema ./schema.graphql
  compose-and-canary:
    steps:
      - run: npx rover supergraph compose --profile strict > supergraph.graphql
      - run: ./deploy_router_canary.sh supergraph.graphql

Appendix X — Testing Strategy Details

- Router integration: mock subgraphs via HTTP fixtures
- Subgraph integration: real DB containers; dataloaders; auth hooks
- Contract: example ops validated against composed graph in CI
- Load tests: persisted ops with realistic variable distributions

Appendix Y — Migration Playbooks

- Monolith → subgraphs: extract entity at a time; proxy unknown fields to monolith
- REST → graph: facade subgraph translating to REST; deprecate endpoints gradually

Appendix Z — Examples

extend type Order @key(fields: "id") {
  id: ID! @external
  total: Money @provides(fields: "currency")
  currency: String @external
}

Appendix AA — Gateway Resiliency Patterns

- Timeouts per subgraph; hedged requests sparingly; circuit breakers
- Partial responses for non-critical fields; user-facing fallbacks
- Bulkhead isolation: limit concurrency per subgraph
- Brownout mode: omit expensive fields under load via @defer/@stream

Appendix AB — Request Shaping and Query Hints

- Encourage clients to request minimal sets; provide profiles (lite/full)
- Use named fragments to standardize shapes; registry validates usage

Appendix AC — CDN Integration

- Only persisted GET at edge; POST → router; vary headers: auth, tenant, locale
- Stale-while-revalidate for semi-static fields; purge on mutations

Appendix AD — Planner Optimization

- Coalesce entity fetches; prefer fewer hops; align @requires with data model
- Avoid cross-subgraph fanout on hot paths; denormalize via @provides where safe

Appendix AE — Error Taxonomy

- USER_INPUT, AUTH, PERMISSION, NOT_FOUND, RATE_LIMIT
- SYSTEM, TIMEOUT, UPSTREAM, PARTIAL_DATA, UNKNOWN
- Map consistently at router; reserve extensions for machine handling

Appendix AF — Security and Privacy

- PII minimization; access logs redacted; privacy review for new fields
- Threat model: query abuse, caching leaks, introspection misuse

Appendix AG — Subgraph Boundaries

- Owners maintain single source of truth; avoid duplicate business logic
- Shared libraries for auth/context only; no cross-domain data coupling

Appendix AH — Contracts and Variants

- Variant per client/app version; hide unstable fields
- Deprecate with usage telemetry; remove after SLO window

Appendix AI — Operational Budgets

- Router p95 < 120ms; planner miss rate < 1%; cache hit > 60%
- Subgraph p95 < 200ms; error < 1%; fanout < 5 per op median

Appendix AJ — Multi-Region and Failover

- Global router with geo routing; local subgraphs per region
- Cross-region failover for reads; write affinity home region

Appendix AK — DataLoader Playbook

// Keyed by viewer scope when necessary to avoid cache bleed
const byId = new DataLoader(ids => batch(ids), { cacheKeyFn: toViewerScopedKey });

Appendix AL — Persisted Operations Rollout

- Phase 1: log-only unknown ops; Phase 2: warn; Phase 3: block
- Emergency allowlist for break-glass ops with expiry

Appendix AM — Complexity Budgeting

- Per-tenant budgets; VIP tiers; adjust by concurrency and historical usage
- Dynamic budgets during incidents to shed load safely

Appendix AN — Subscriptions Topology

- Shared websocket layer; subgraphs via event bus; backpressure and limits

Appendix AO — Defer/Stream UX

- Show skeleton for deferred parts; stabilize layout to avoid CLS

Appendix AP — Observability: PromQL

# Router error rate
sum(rate(router_requests_total{status=~"5.."}[5m])) / sum(rate(router_requests_total[5m]))

# Subgraph p95
histogram_quantile(0.95, sum(rate(subgraph_request_duration_seconds_bucket[5m])) by (le, subgraph))

# Planner miss
sum(rate(router_planner_miss_total[5m]))

Appendix AQ — Grafana Dashboard (Sketch JSON)

{
  "title": "Supergraph Overview",
  "panels": [
    {"type":"stat","title":"Router p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(router_request_duration_seconds_bucket[5m])) by (le))"}]},
    {"type":"table","title":"Subgraph Errors","targets":[{"expr":"sum(rate(subgraph_requests_total{status=~'5..'}[5m])) by (subgraph)"}]}
  ]
}

Appendix AR — Alert Library (YAML)

- alert: RouterHighErrorRate
  expr: (sum(rate(router_requests_total{status=~"5.."}[5m])))/(sum(rate(router_requests_total[5m]))+1e-9) > 0.02
  for: 10m
- alert: SubgraphLatencyP95High
  expr: histogram_quantile(0.95, sum(rate(subgraph_request_duration_seconds_bucket[5m])) by (le, subgraph)) > 0.300
  for: 10m

Appendix AS — CI Gates

- composition: fail on breaking changes and invalid references
- contracts: block removal of contracted fields used by clients
- persisted ops: require registry publish before router deploy

Appendix AT — Example Repos

repos/
  users-subgraph/
  orders-subgraph/
  catalog-subgraph/
  supergraph-router/
  contracts/

Appendix AU — Case Study: Checkout

- Query fetches cart (catalog), user (users), prices/tax (pricing), inventory
- Defer reviews and recommendations
- Canary: drop recs under load; preserve checkout core path

Appendix AV — Error Handling Examples

// Router error mapping
if (err.code === 'ECONNABORTED') classify('TIMEOUT');

Appendix AW — Pagination and Connections

type Connection { edges: [Edge!]!, pageInfo: PageInfo! }

Appendix AX — Federation with Mesh/Mercurius

- Mesh to unify REST/GraphQL/gRPC into graph; use as subgraph source
- Mercurius Federation v2 for Node-based subgraphs

Appendix AY — Security: SSRF, RCE, Injection

- Strict outbound allowlists; sanitize URLs; avoid dynamic eval; validate inputs

Appendix AZ — Runbooks

- Spike in 5xx: identify subgraph; circuit break; reduce query budgets; rollback
- Planner miss surge: warm caches; review op churn; deploy plan cache increase
- Cache stampede: introduce jitter TTL; precompute hot paths

Mega FAQ (401–1000)

Do we encrypt at field level?
For highly sensitive fields; store ciphertext; decrypt at viewer edge.
Should we block introspection?
For public clients, yes; allow for trusted admin behind auth.
How to keep router hot?
Warm plan and result caches; pin CPU; avoid GC pauses; keep binaries slim.
Why do we see N+1 at router?
It’s subgraph N+1 surfacing; fix with dataloaders and batch APIs.
Pagination best practice?
Connections with opaque cursors; avoid offset for large datasets.
Can we stream errors?
Use incremental delivery; attach errors to path; keep UX stable.
How to phase out a subgraph?
Stop extensions; move ownership; mark deprecated; remove after usage=0.
Query cost vs depth?
Use cost map; depth alone is insufficient.
Multi-tenant guardrails?
Per-tenant budgets, rate limits, caches; audit access.
Final: prefer clarity, constrain costs, measure constantly.

Appendix BA — Client Guidance and Operation Hygiene

- Use named operations and fragments; include client name/version headers
- Prefer persisted queries; avoid ad-hoc POST in production
- Request minimal fields; defer non-critical sections; paginate large lists
- Consider offline caching and stale-while-revalidate patterns

Appendix BB — Router Horizontal Scaling

- Stateless router replicas behind L4; sticky only for websockets if needed
- Preload supergraph; watch registry for changes; SIGHUP or hot-reload
- Limit max concurrent per replica; tune threadpool/event-loop

Appendix BC — Supergraph Lifecycle

- Draft composition → canary → 25% → 100%; abort on error budget burn
- Rollback by pinning previous supergraph; invalidate router caches
- Track composition ID, commit SHAs, owner, change ticket

Appendix BD — Mutations and Side-Effects

- Keep mutations in owning subgraph; avoid multi-subgraph transactional semantics
- Emit domain events; eventual consistency for projections
- Idempotency keys for retries; return stable mutation payload shapes

Appendix BE — File Uploads and Binary Data

- Prefer signed URLs via REST; graph returns metadata and URLs
- Limit upload size; virus/malware scanning; audit trails

Appendix BF — Partial Data UX

- Render available fields with skeletons; show inline warnings for gaps
- Avoid blocking primary flows on optional subgraph outages

Appendix BG — Schema Style Guide

- Nouns for types, verbs for mutations; consistent pagination
- Use ISO-8601 timestamps; Money type with currency; IDs opaque
- Avoid leaking storage/DB fields; present domain-centric names

Appendix BH — Federation with Legacy Backends

- Subgraph as facade: translate to REST/gRPC/DB; cache hot lookups
- Stabilize latency with bulk endpoints; prefer batched loaders

Appendix BI — Error Extensions Contract

{
  "errors": [
    {
      "message": "Not authorized",
      "extensions": {
        "code": "PERMISSION",
        "subgraph": "orders",
        "requestId": "..."
      },
      "path": ["order", "total"]
    }
  ]
}

Appendix BJ — CORS and Edge Security

- Strict origins; credentials only when needed; same-site cookies preferred
- HSTS, CSP; block mixed content; validate referer/origin on sensitive ops

Appendix BK — Subgraph Read Models

- Pre-compute aggregates; materialized views; denormalize sparingly
- Keep entity ownership; expose read-friendly fields

Appendix BL — Safe Defaults

- Limit depth to 10; cost budget per op; request timeout 3s
- Disable schema introspection for public clients; enable for admin

Appendix BM — Canary Recipes

- Shadow traffic to new subgraph version; compare error/latency
- Gradual router rollout of new supergraph; per-client cohort testing

Appendix BN — Edge Authorization Tokens

- Short-lived JWTs; rotate keys; anti-replay; audience and scope validated

Appendix BO — Latency Budgets

- Router < 120ms p95; subgraph < 200ms p95 each; cumulative budget by path
- Automatic brownouts when over budget: defer/omit optional fields

Appendix BP — SDL Lint Rules

- Disallow null list elements; consistent naming; no dangling types
- Require descriptions on public fields; deprecations include rationale

Appendix BQ — Plan Cache Warmers

- Periodically execute top persisted queries to warm plan/result caches
- Bust on composition changes; record hit/miss metrics

Appendix BR — Multi-Tenant Abstractions

- Tenant-aware dataloaders; tenant-scoped caches; tenant quotas and limits

Appendix BS — Backpressure and Load Shedding

- Queue caps; 429 with Retry-After; degrade non-critical fields

Appendix BT — Secrets and Config

- Router and subgraphs via secure env vars or secrets; no secrets in SDL
- Rotate credentials regularly; audit access

Appendix BU — Infra Templates (Kubernetes)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 4
  maxReplicas: 20
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }

Appendix BV — Gateway Feature Flags

- Enable/disable features by client or cohort; test in prod safely

Appendix BW — Data Privacy Regions

- Route EU users to EU subgraphs; ensure data residency with policy

Appendix BX — Operator Runbooks

- Router CPU spike → check plan miss, request surge, cache stampede
- Subgraph 5xx spike → circuit break, page owner, roll back latest change

Appendix BY — Docs and Developer Experience

- Self-serve SDL portal; examples; persisted op explorer; performance tips

Appendix BZ — Cost Controls

- Cache hot reads; batch; avoid over-fetch; suppress heavy optional fields

Observability: Panel JSON (Sketch)

{
  "title": "Federation Health",
  "panels": [
    {"type":"stat","title":"Router p95"},
    {"type":"table","title":"Subgraph 5xx by svc"},
    {"type":"graph","title":"Plan cache hit"}
  ]
}

Policies (Pseudo-OPA)

package graph.policy

violation["no_latest_images"] {
  input.kind == "Deployment"
  some c
  c := input.spec.template.spec.containers[_]
  endswith(c.image, ":latest")
}

Templates: Persisted Ops Enforcement (Edge)

location /graphql {
  if ($request_method = GET) { # persisted only
    proxy_pass http://router;
  }
  if ($request_method = POST) { return 405; }
}

Case Study — Profile Page

- User (users), purchases (orders), recommendations (recs)
- Defer recs; cache user; stream purchases

Mega FAQ (1001–1200)

Do we allow arbitrary queries in prod?
No—persisted queries only for public clients; log/deny unknown.
How big should a subgraph be?
Team-sized, domain-aligned; avoid too fine-grained fragmentation.
Who owns entity keys?
The entity’s owning subgraph; others must not redefine keys.
Can we have cross-DB transactions?
Avoid; use events and compensation.
Should we expose IDs or slugs?
Opaque IDs for references; slugs as fields where useful.

Mega FAQ (1201–1400)

Why depth limits are insufficient?
Low-depth fields can be expensive; use cost maps.
Can router rewrite queries?
Prefer not; only for technical normalization; never change semantics.
Where to authorize?
At edge for identity, in subgraphs for domain constraints.
Can we cache POST?
Not at CDN; router internal caches okay with safe keys.
Should we enable introspection?
For trusted tools; disable for public paths.

Mega FAQ (1401–1600)

How to sunset a field?
Deprecate, track usage, communicate, remove after window.
When to use @provides?
When one subgraph can reliably supply a field owned by another with clear contract.
Handle heavy lists?
Paginate, limit, and stream; precompute where possible.
Why plan cache misses?
New ops or variable shapes; warm caches; standardize client fragments.

Mega FAQ (1601–1800)

Should we allow mutations across subgraphs?
Keep in single owner; orchestrate via events.
Rate limit at edge or router?
Edge for coarse limits; router for per-op budgets.
Can we run router at the edge?
Yes with wasm/native binaries; test cold starts and limits.
Final: keep graphs lean, plans hot, and costs bounded.

Appendix CA — Gateway Deployment Patterns

- Sidecar telemetry; init to fetch supergraph; read-only FS; non-root user
- Blue/green router rollout; health gates on p95 and error rate
- Multi-tenant router fleets for isolation and quota enforcement

Appendix CB — Supergraph Delivery

- Signed supergraph bundles; checksum verification; fallback to last-good
- Progressive rollout by region and client cohort

Appendix CC — Schema Lint Pack

- Require descriptions; ban ID-as-Int; enforce Money scalar usage
- Disallow nullable lists of non-nullable elements unless justified

Appendix CD — Router Warmup and Preload

- Preload top persisted ops; synthetic traffic warmers; plan cache hydrate

Appendix CE — Hot Paths Catalog

- Identify top 20 operations by volume; publish budgets and owners
- Quarterly review of hot path latency and cost

Appendix CF — Error Budgets and Freeze

- Burn >2x for 60m: freeze router and subgraph deploys; focus on recovery

Appendix CG — Redaction and Privacy Tests

- Unit tests: variables redacted; logs scrubbed; traces tagged without PII

Appendix CH — Client Contract Testing

- Validate example queries per client against composed graph; diff on changes

Appendix CI — Schema Change Classes

- Safe: new types/fields with defaults; new queries
- Risky: required input fields; enum removals; field removals
- Block: breaking changes without deprecation window and 0 usage

Appendix CJ — Router Rate Limiting

- Token bucket per client and tenant; budgets per operation class
- 429 with Retry-After; degrade optional fields under pressure

Appendix CK — Subgraph SLOs and Ownership

- Each subgraph: latency/error SLOs; on-call; dashboards; runbooks

Appendix CL — Observability Fields

- route=operationName, clientName, variant, subgraph, planSteps, fanout

Appendix CM — Replay and Backfills

- Use idempotent loaders; replay cached persisted ops to warm downstream caches

Appendix CN — Edge Rules

- Block GraphQL playground in prod; CSP strict; CORS allowlist

Appendix CO — Capturing Usage

- Per-field usage; deprecation candidates; client impact analysis

Appendix CP — Threat Model Highlights

- Abuse via expensive queries; cache key leaks; SSRF through resolvers
- Mitigate with cost ceilings, redaction, outbound allowlists

Appendix CQ — Synthetic Monitors

- Canary persisted ops with thresholds; alert on regressions

Appendix CR — Chaos Scenarios

- Kill one subgraph; check brownout UX and partial data resilience
- Router rollout with bad bundle; ensure last-good fallback

Appendix CS — DR and Failover

- Router multi-region anycast; subgraphs active/active or active/passive
- Test failover quarterly with evidence

Appendix CT — Access Patterns

- Public vs private graphs; partner variants; admin ops separated

Appendix CU — Governance Dashboards

- Composition frequency; breaking change attempts; usage of deprecated fields

Appendix CV — Cost Dashboards

- Router CPU/mem per op; subgraph cost per k requests; cache ROI

Appendix CW — Developer Tooling

- CLI to generate persisted ops; schema diffs; usage reports per team

Appendix CX — Education

- Playbooks for defer/stream, pagination, caching, security

Appendix CY — Backward Compatibility Windows

- 60–90 days typical; faster with client feature flags and phased rollout

Appendix CZ — Final Principles

- Stable contracts, bounded costs, observable systems, and fast, safe delivery

Operations Runbooks (Extended)

Incident: Router 5xx spike
- Compare to client/version; check subgraph breakdown; circuit break worst
- Reduce budgets; enable brownout; roll back supergraph if composition changed

Incident: Cache stampede
- Increase TTL with jitter; warmers; protect backend with concurrency caps

Incident: Planner miss surge
- Identify new ops; enforce persisted; pre-warm; client guidance

Dashboards (Sketch JSON)

{
  "title": "GraphQL Router",
  "panels": [
    {"type":"stat","title":"p95","targets":[{"expr":"histogram_quantile(0.95,sum(rate(router_request_duration_seconds_bucket[5m])) by (le))"}]},
    {"type":"graph","title":"Cache Hit","targets":[{"expr":"sum(rate(router_cache_hit_total[5m]))/sum(rate(router_cache_total[5m]))"}]},
    {"type":"table","title":"Subgraph Errors","targets":[{"expr":"sum(rate(subgraph_requests_total{status=~'5..'}[5m])) by (subgraph)"}]}
  ]
}

Mega FAQ (1801–2000)

Can we auto-defer under load?
Yes—brownout mode defers/omits optional fields based on budgets.
Should we log variables?
Redacted only; never PII; sample with care.
Why partial data errors upset clients?
Educate and standardize UI patterns; keep core flows intact.
How to bound cost for partners?
Contracts + budgets + rate limits + persisted-only.

Mega FAQ (2001–2200)

Do we pin digests for router image?
Yes for prod; signed; provenance verified.
Multi-cloud supergraph?
Possible with registry sync and region-local subgraphs.
How to find deprecation targets?
Field usage telemetry sorted by zero-usage period.
Are mutations cacheable?
No; but mutation results can prime read caches.

Mega FAQ (2201–2400)

What breaks composition most?
Key mismatches, conflicting type ownership, incompatible field nullability.
How to test subscriptions at scale?
Synthetic publishers; backpressure; soak tests; fanout metrics.
Should we expose enums?
Yes with caution; include UNKNOWN; version with care.
Final: keep graphs clean, fast, and secure—optimize for maintainability.

Appendix DA — Supergraph Change Windows

- Define weekly windows for high-risk composition changes
- Auto-pause canary during incidents; require explicit resume

Appendix DB — Client Upgrade Strategy

- Contract variants per app version; deprecate with telemetry
- Feature flags map to fields; phased rollout by cohort

Appendix DC — Router Sandbox Mode

- Dry-run new supergraph in parallel; compare plans and latencies
- Toggle per-client cohort; no user impact during validation

Appendix DD — Subgraph Health Contracts

- Liveness: DB connectivity, cache reachability
- Readiness: migrations applied, warm caches, dependencies healthy

Appendix DE — Gateway Canary SLOs

- p95 latency within 10% of baseline; error rate delta < 0.5%
- Plan cache hit within 5% of baseline after 10 minutes

Appendix DF — Traffic Shaping

- Shift by client, region, or op-class; protect hot paths first
- Brownout overrides for optional fields; defer under load

Appendix DG — Subgraph Outage Playbook

- Circuit break; mark fields as deferred/omitted; communicate status
- Warm fallback caches where possible; page on-call and track MTTR

Appendix DH — Writer Isolation

- Route writes to home region; read replicas elsewhere; expose version stamps
- Avoid multi-subgraph transactions; orchestrate via events

Appendix DI — Cost Attribution

- Attribute router CPU/mem per operation and client
- Attribute subgraph cost by op fanout and latency; report to owners

Appendix DJ — SDL Conventions

- Use @deprecated with reason; include removal date
- Avoid leaking storage keys; opaque IDs with node lookup when needed

Appendix DK — Query Shape Catalog

- Catalog top shapes; provide named fragments; enforce via lints

Appendix DL — Subgraph Release Trains

- Batch subgraph releases weekly; align with supergraph comps
- Reduce composition churn and planner cache misses

Appendix DM — Router Resource Profiles

- CPU-bound vs IO-bound; adjust threads; pre-alloc arenas; GC tuning

Appendix DN — Persistence and Caches

- Redis/memcached for router result cache; LRU; per-tenant partitions

Appendix DO — Schema Diff Bots

- PR bot annotates diffs, breaking risks, usage impact, owners to review

Appendix DP — SLA and Budgets Dashboard

{
  "title": "Supergraph SLOs",
  "panels": [
    {"type":"stat","title":"Router SLO"},
    {"type":"stat","title":"Subgraph Error Budget Burn"}
  ]
}

Appendix DQ — Access Reviews

- Quarterly review of client access, variants, and scopes
- Revoke stale keys; rotate credentials

Appendix DR — Secrets and Key Management

- JWKS rotation; HSM/KMS for signing; key expiry policies

Appendix DS — Regionalization and Residency

- Split subgraphs per region; forbid cross-region PII via policy

Appendix DT — Playground and Tooling

- Internal dev-only; pre-auth; rate limits; operation registry integration

Appendix DU — Observability Annotations

- Tag spans with op class (hotpath|bulk|admin), client tier, cost estimate

Appendix DV — Rollback Evidence

- Store comparison charts for before/after; attach to incident and PR

Appendix DW — Partner Integrations

- Contract variants; SLA per partner; sandbox environments; sample data

Appendix DX — Education Path

- 101 GraphQL; 201 Federation; 301 Observability and Cost; 401 Security

Appendix DY — Data Classifications

- PUBLIC, INTERNAL, CONFIDENTIAL, SECRET; field annotations; policy gates

Appendix DZ — Final Operating Principles

- Contracts first; costs bounded; caches warm; rollouts safe; evidence always

Extended Templates

# Money scalar
scalar Money

# Node interface
interface Node { id: ID! }

# Router Deployment hardened
securityContext:
  runAsNonRoot: true
  readOnlyRootFilesystem: true

// Router middleware pseudo
function beforeResolve(ctx) {
  ctx.vars.requestStart = Date.now();
  ctx.vars.cost = estimateCost(ctx.operation);
  if (ctx.vars.cost > ctx.budget) throw new Error('COST_EXCEEDED');
}

Mega FAQ (2401–2600)

Why do plan cache misses spike?
New ops/contracts; warmers; reduce op churn; guide client fragments.
Can we proxy REST as a subgraph?
Yes via Mesh or facade; ensure batching and caching to avoid N+1.
Is GraphQL suitable for all endpoints?
Not binary uploads; keep those in REST; graph returns metadata.
How to detach a subgraph?
Move ownership; deprecate fields; drain traffic; remove from supergraph.

Mega FAQ (2601–2800)

Should public clients have separate router?
Prefer separate fleet with stricter policies and caching.
How to test composition failures?
Inject schema conflicts in staging; verify gates block deploys.
Are enums safe for partners?
Include UNKNOWN; evolve carefully; provide contracts per partner.
When to pagination vs windowing?
Paginate user lists; window analytics; stream where appropriate.

Mega FAQ (2801–3000)

Defer/stream impact on SEO?
Server-render critical content; hydrate incrementals; use placeholders.
Should we cap list sizes?
Yes; hard caps; client-friendly errors; links to next pages.
Is GraphQL cacheable at CDN?
Persisted GET only; vary keys strictly; purge on mutations.
Final: simplicity wins—clean schemas, stable ops, and tight ops.

Mega FAQ (3001–3200)

Cost budgets per tier?
Yes: free, pro, enterprise with increasing caps and support.
Run router at edge?
Possible; ensure plan cache warm, memory caps, and cold-start tests.
Multi-tenant fairness?
Weighted queues, per-tenant budgets, strict isolation for heavy tenants.
Last word: observable, predictable, and efficient graphs at scale.

Appendix EA — Partner Sandbox and Throttles

- Separate variants and routers; synthetic data; strict budgets and persisted-only

Appendix EB — Budgeted Operations Catalog

- Define allowed ops per client tier; publish limits; auto-annotate violations

Appendix EC — Evidence Automation

- Attach dashboards, traces, composition diffs, and PRs to every rollout

Appendix ED — Education and Checklists

- Pre-merge: schema lint, composition ok, deprecations reviewed, cost within budget
- Pre-release: caches warm, canary plan, rollback steps, owners on-call

Appendix EE — Closing Notes

- Federate when teams are ready; keep contracts lean; measure and iterate

Final FAQ (3201–3400)

Should we allow inline fragments everywhere?
Yes, but encourage named fragments for reusability and cacheability.
Are schema stitches via REST OK?
Yes via Mesh/facades with batching; watch latency and error surfaces.
Can we expose GraphiQL in prod?
Not for public; internal behind auth and rate limits only.
When to split a subgraph?
When ownership or scaling diverges; avoid ping-ponging entities.
Final: stable supergraph, fast router, efficient subgraphs—own your slice well.

Quick Reference

- Persisted ops only for public
- Bound costs and depths
- Warm plan/result caches
- Observe p95, error, fanout
- Defer/stream non-critical fields

Troubleshooting Index

- High p95: check subgraph; cache hit; plan miss; hot paths
- Many 5xx: classify errors; circuit break; rollback recent changes
- Composition fail: key conflicts; nullability; directives mismatch

Additional FAQ (3401–3600)

How to handle client cache invalidation?
Return entity version fields; use cache policies; purge on mutation.
Fallback when a subgraph is slow?
Defer optional fields; set timeouts; render partial data with notices.
Do we expose node interface?
When helpful for universal fetches; ensure opaque IDs.
Final: contracts, costs, caches, and care.

Closing

Federation succeeds when domain ownership is clear, contracts are stable, and router/subgraphs are observable and resilient. Keep the supergraph lean, plan caches warm, and client operations disciplined.

Appendix EF — Gateway Cold Start Strategy

- Pre-bake router image with dependencies; lazy-load optional modules
- Warm plan/result caches after deploy using top persisted ops
- Keep supergraph bundle local with checksum verification

Appendix EG — Tenant Fairness and Quotas

- Weighted fair queue per tenant; enforce per-op and per-minute budgets
- Distinct queues for hot paths vs bulk analytics to avoid starvation

Appendix EH — Error Budgets per Path

- Track error budgets per operation class; freeze risky schema changes on burn
- Route suspicious traffic to lower-cost variants during incidents

Appendix EI — Subgraph Cache Contracts

- Define cache invariants for read models; TTLs; invalidation hooks on mutations
- Expose version fields and lastUpdated timestamps for client-side reconciliation

Appendix EJ — Router Memory Hygiene

- Cap result size; stream large lists; enforce max response bytes
- Tune alloc arenas; monitor GC pauses; pre-size buffers for hot paths

Appendix EK — Partner Governance

- Separate partner variants with strict schemas and cost ceilings
- Contract SLAs, deprecation calendars, and emergency kill switches

Appendix EL — Security Headers and TLS

- HSTS, CSP (report-only then enforce), COOP/COEP for isolation
- mTLS to subgraphs; pin CA; rotate certs automatically

Appendix EM — Router Feature Flags

- Gradually enable planner optimizations, caching strategies, and brownout rules
- Flags are persisted and audited; roll back instantly if regress

Appendix EN — Cost Visibility for Teams

- Dashboard cost per field and operation; show top offenders per team
- Quarterly reviews to reduce over-fetch and normalize shapes

Appendix EO — Data Residency Enforcement

- Policy denies field access outside region; annotate SDL with residency class
- Router selects region-local subgraphs; fallback only for public data

Appendix EP — SDL Documentation Automation

- Generate docs per variant; include usage charts and deprecation timelines
- Link to persisted operations catalog and client examples

Appendix EQ — Blue/Green Subgraphs

- Run v1 and v2 side-by-side; router targets cohort; compare KPIs
- Drain old after success; archive metrics and composition snapshot

Appendix ER — Incident Drill Catalog

- Subgraph timeout storm; router cache miss spike; partner abuse burst
- Each drill has playbook, metrics targets, and evidence bundle template

Appendix ES — Access Tokens and JWKS

- Rotate signing keys; cache JWKS with short TTL; fail closed on mismatch
- Support key rollover with overlapping validity windows

Appendix ET — Plan Cache Engineering

- Keyed by signature and stable variable shapes; bucketized for memory caps
- Evict LRU; protect hot entries; export hit/miss and evictions

Appendix EU — Result Cache Engineering

- Respect cache-control hints; per-tenant segments; compression on
- Invalidate on mutation topics; jitter TTLs to avoid herds

Appendix EV — Subgraph Sandboxes

- Ephemeral envs for schema experiments; compose against staging supergraph
- Synthetic data sets; abuse testing; performance benchmarks

Appendix EW — Ops CLI

- Commands: diff-supergraph, warm-caches, dump-hotpaths, block-op, lift-freeze

Appendix EX — Education Tracks

- Clients: operation hygiene and caching; Subgraphs: data loaders and auth
- Operators: observability deep dive; Security: privacy and attack surfaces

Appendix EY — Privacy Reviews

- Checklist per new field: classification, retention, residency, purpose
- Automated scanning to flag risky names and patterns (e.g., ssn, email)

Appendix EZ — Final Practices

- Own domains; keep contracts minimal; budget costs; measure relentlessly

Mega FAQ (3601–4000)

Why do we still see timeouts with cached ops?
Subgraph latency dominates; cache helps router but not slow sources.
Should we allow multi-hop entity chains?
Limit; deep chains explode latency. Flatten with views or denormalization.
How to guard against schema sprawl?
Lint packs, ownership, review gates, and usage-based pruning.
Can clients pick variants dynamically?
Yes via headers; enforce allowlists; migrate gradually.
How to handle rogue clients?
Block at edge; revoke keys; quarantine ops; notify owners.
Router CPU spikes after deploy?
Plan cache cold; warmers; reduce composition churn; profile GC.
Are batched resolvers always better?
Usually, but watch peak memory; cap batch sizes; stream results.
Should we allow file uploads through graph?
Prefer signed URLs via REST; keep graph metadata-only.
Federation vs BFFs?
Federation centralizes contracts; BFFs can coexist as clients of the graph.
Final: clarity, constraints, and continuous care keep graphs healthy.