GraphQL Federation for Microservices and API Gateways (2025)

Oct 26, 2025
graphqlfederationapi-gatewaymicroservices
0

Federation enables teams to own subgraphs while exposing a unified API. This guide focuses on practical schema design and operations.

Executive summary

  • Define clear ownership and boundaries; avoid cross-subgraph coupling
  • Persisted queries and operation registry for safety/perf; cache where stable
  • Monitor resolver costs; implement DDoS and complexity limits

Subgraphs and composition

  • Entity references; keys; value types; composed schema pipelines

Gateway concerns

  • AuthN/Z; caching; persisted queries; complexity analysis; timeouts/retries

Deployment

  • Versioned subgraphs; canary; automated composition checks; contracts

Observability

  • Trace resolvers; cost budgets; rate limiting per client/operation

FAQ

Q: When to choose federation vs monolithic GraphQL?
A: Federation for large orgs with clear domain ownership; monolith for small teams to avoid overhead.


1) What Is Federation?

  • Split a single graph across subgraphs owned by domain teams
  • Compose at a router/gateway into a single supergraph

2) Core Components

- Router/Gateway (Apollo Router/Federation, GraphQL Mesh, Helix, Mercurius)
- Subgraphs (domain services) exposing GraphQL schemas with federation directives
- Schema registry and composition pipeline

3) Entities and Keys

# Example subgraph entity
 type User @key(fields: "id") {
   id: ID!
   name: String
 }

4) Reference Resolution

// __resolveReference in subgraph
export const User = {
  __resolveReference(ref: { id: string }, ctx: Ctx) {
    return ctx.users.byId(ref.id);
  }
};

5) Composition and Contract

  • Use schema registry (Apollo/GraphOS or open-source) to validate
  • Contracts hide fields/types for specific clients

6) Query Planning

  • Router splits query across subgraphs; stitches results
  • Optimize with @requires, @provides, and proper entity boundaries

7) Preventing N+1

// DataLoader pattern per field
const userLoader = new DataLoader(ids => batchGetUsers(ids));

8) Caching

- CDN for GET persisted queries
- Router cache for query plans and results (TTL + cache hints)
- Edge caches per viewer for personalization

9) APQ and Persisted Queries

  • APQ reduces payload; persisted queries lock down operations
  • Operation registry with safelist; block arbitrary queries in prod

10) AuthN/Z

- JWT/OAuth at edge; propagate identity to subgraphs via headers or context
- Field-level auth: directive-based checks; schema-policy integration
- Multi-tenant scoping: orgId in context; enforce in resolvers

11) Rate Limiting and Abuse

- Edge/L7 limits per token/IP; router-level query cost budgets
- Subgraph local limits for hotspots

12) Complexity and Depth Limits

// cost map per type/field; block expensive operations; per-tenant budgets

13) Errors and Retries

- Distinguish user vs system errors; partial data with errors array
- Retry idempotent subgraph calls with backoff; circuit breakers on failures

14) Observability

- OpenTelemetry traces: edge → router → subgraphs; include operation names
- Metrics: p95 latency, error rate, cache hit, planner misses, subgraph fanout
- Logs: redacted variables, request IDs, client name/version

15) Schema Governance

- PR checks: composition, breaking changes, deprecations
- Contract tests: consumer-driven; example queries validated

16) CI/CD Flow

- Subgraph schema lint → publish to registry → compose → router rollout
- Canary new composition; rollback on composition or runtime regressions

17) Router Config (Sketch)

[supergraph]
subgraphs = ["users", "orders", "catalog"]

[cors]
origins = ["https://app.example.com"]

[telemetry]
otlp = { endpoint = "otel:4317" }

[caching]
plan_ttl = "5m"

18) Subgraph Examples

extend type User @key(fields: "id") {
  id: ID! @external
  orders: [Order] @requires(fields: "id")
}

19) Dataloaders at Subgraphs

export const resolvers = {
  Query: { user: (_, { id }, ctx) => ctx.users.byId(id) },
  User: { orders: (u, _, ctx) => ctx.orders.byUserIds.load(u.id) }
};

20) Edge Caching with CDNs

  • GET persisted queries only
  • Vary on auth/tenant headers if necessary; use cache hints

21) Security

- Disallow introspection in prod for public clients; allow for trusted admin
- Input validation; query cost ceilings; depth limits; timeouts
- PII minimization; field-level encryption where necessary

22) Versioning and Deprecations

  • Avoid versioning the graph; use deprecations and contracts
  • Remove after deprecation window with usage telemetry

23) Migration Playbooks

- REST → GraphQL: facade router; strangler pattern; move domains incrementally
- Monolith → subgraphs: extract entities one domain at a time; measure latency

24) Testing Strategy

- Unit: resolvers and policies
- Integration: router + subset of subgraphs with mock HTTP
- Contract: example operations validated against composed schema
- E2E: critical app flows with persisted queries

25) Failure Modes

- Subgraph outage: partial data; error isolation; fallback content
- Composition failure: block rollout; alert; fix schema conflicts
- Hot path blowups: query cost guard; cache; paginate

26) Templates and Repos

subgraphs/users
  schema.graphql
  src/
subgraphs/orders
router/
  supergraph.graphql (generated)
  router.toml

27) Mega FAQ (1–400)

  1. Should I start with federation?
    No—start with a single graph; federate when teams/domain boundaries are clear.

  2. How many subgraphs?
    As many as team ownership and latency allow—avoid excessive fanout.

  3. N+1 in router or subgraph?
    Fix in subgraph with dataloaders; router should not hide subgraph inefficiency.

  4. Do I need a registry?
    Yes for safe composition, contracts, and usage insights.

...


JSON-LD

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "GraphQL Federation and API Gateway (2025)",
  "description": "Production-grade guide to GraphQL federation: routers, subgraphs, composition, caching, security, observability, and operations.",
  "datePublished": "2025-10-28",
  "dateModified": "2025-10-28",
  "author": {"@type":"Person","name":"Elysiate"}
}
</script>


CTA

Need a resilient, fast federated graph? We architect supergraphs, implement routers, and harden subgraphs end‑to‑end.


Appendix A — Supergraph Architecture

- Router as a stateless layer; horizontal scale; fast startup; hot reload of supergraph
- Subgraphs own domain models; strict boundaries; clear entity ownership
- Registry for composition and contracts; CI gates for safe rollout

Appendix B — Router Configuration Patterns

# apollo-router.toml (sketch)
[supergraph]
listen = "0.0.0.0:4000"

[cors]
origins = ["https://app.example.com"]
allow_credentials = true

[headers]
forward = ["authorization", "x-tenant-id", "x-request-id"]

[telemetry]
exporter = "otlp"
endpoint = "otel:4317"

[caching]
plan_cache_ttl = "10m"
result_cache_ttl = "30s"

[timeouts]
overall = "3s"
subgraph = "2.5s"

Appendix C — Composition Workflow and Registry

- Subgraph PR: schema lint → publish to registry as draft → composition check
- Contract variants per client → hide internal fields and federated joins
- Composition promotion: canary router loads new supergraph; observe KPIs

Appendix D — Caching and Hints

# Cache control hints
extend schema @cacheControl(defaultMaxAge: 60)

type Product @key(fields: "id") {
  id: ID!
  name: String @cacheControl(maxAge: 300)
  price: Money @cacheControl(scope: PRIVATE)
}
- Edge cache GET persisted queries; vary on viewer keys if needed
- Router result cache + per-field cache hints; subgraph-level HTTP caching

Appendix E — Operation Registry and Persisted Queries

- Safelist persisted operations; block ad-hoc queries in prod
- Client name/version required; roll out new ops via registry publish
- CDN stores GET /?hash=... with long TTL + revalidation

Appendix F — Query Cost and Depth Guards

type CostMap = Record<string, number>; // type.field → cost

function estimateCost(selectionSet: any, costMap: CostMap): number {
  // Walk AST; sum costs; account for list multipliers
  return 0; // impl omitted
}

// Reject if cost > tenantBudget or depth > maxDepth

Appendix G — Auth Patterns

- Edge: verify JWT; attach claims to context; reject expired/invalid
- Router: propagate headers; optional field-level policies via directives
- Subgraph: enforce domain authorization; never rely solely on router checks

Appendix H — Multi-Tenancy

- tenantId in context; row-level filters in subgraphs
- Cache segmentation by tenant; rate limits per tenant and client

Appendix I — Schema Contracts and Variants

- Base supergraph for internal; contract variants for mobile/web/partners
- Hide unstable fields; deprecation windows tracked via usage metrics

Appendix J — Consistency and Latency

- Eventual consistency across subgraphs; avoid cross-service transactions
- Use entity views; subscribe or poll for refresh; expose updatedAt

Appendix K — Subscriptions and Real-Time

# Router passes through websocket; subgraphs stream events
 type Subscription {
   orderUpdated(id: ID!): OrderEvent!
 }
- Prefer server push for high-value events; throttle rate; backpressure

Appendix L — Defer/Stream and Batching

query ProductPage @defer {
  product(id: "p1") {
    id
    name
    reviews @defer { body rating }
  }
}
- Stream lists; defer expensive fields; reduce TTFB
- Batch subgraph calls with Dataloaders; coalesce per-request

Appendix M — Retries, Timeouts, and Circuit Breaking

- Subgraph HTTP timeouts < router overall timeout; retry idempotent GETs
- Circuit break noisy subgraphs; partial responses with error annotations
- SLOs: p95 < 200ms per subgraph; error rate < 1%

Templates

# Kubernetes deployment for router (sketch)
apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 4
  template:
    spec:
      containers:
        - name: router
          image: ghcr.io/org/router:1.0.0
          ports: [{ containerPort: 4000 }]
          env:
            - { name: OTEL_EXPORTER_OTLP_ENDPOINT, value: http://otel:4317 }
            - { name: SUPERGRAPH_FILE, value: /etc/supergraph.graphql }

Appendix N — Federation v2 Directives and Patterns

# Common directives: @key, @requires, @provides, @shareable, @inaccessible
# Example
 type Product @key(fields: "id") {
   id: ID!
   name: String @shareable
   seller: Seller @requires(fields: "id")
 }
- Prefer @key on natural identifiers when stable; otherwise synthetic IDs
- Use @requires to fetch local fields needed for downstream resolution
- Limit @provides to clear, stable contracts; avoid tight coupling

Appendix O — Entity Design

- One owning subgraph per entity; others extend only
- Keep entities small; expose views for heavy aggregates
- Avoid cyclic ownership; model via references and views

Appendix P — Router Resilience

- Timeouts per subgraph; overall request time budget
- Circuit breakers on error spikes; partial responses with error nodes
- Backoff retries only for idempotent fetches; never for mutations

Appendix Q — Caching and CDN Strategy

- Persisted GET only at edge; Vary by auth/tenant
- Router result cache keyed by operation + variables + viewer scope
- Use cache hints; private scope for user data; short TTLs for hot paths

Appendix R — Persisted Queries Enforcement

- Operation registry with safelist; block unknown hashes in prod
- Deployment gates: router loads only approved op set per client version

Appendix S — Auth and Tenancy Patterns

- Edge verifies JWT; attach tenantId, roles
- Subgraphs enforce domain auth; row filters by tenantId
- Field-level directives for sensitive data; audit access

Appendix T — Subscriptions, Defer, and Stream

- Subscriptions for high-value events; throttle and backpressure
- Defer/Stream: reduce TTFB by streaming non-critical fields/lists
- Clients must handle incremental payloads robustly

Appendix U — Security

- Disable introspection for public clients; enforce query cost/depth
- Sanitize errors; redact variables in logs; input validation
- SSRF prevention in subgraphs; outbound egress allowlists

Appendix V — Observability Deep Dive

- Trace IDs from edge; router span names = operation:client
- Attributes: subgraph, planStep, costEstimate, cacheHit
- Metrics: planner_miss, fanout, subgraph_p95, error_rate, cache_hit_rate

Appendix W — CI/CD Workflows

name: federation
on: [pull_request]
jobs:
  check-subgraph:
    steps:
      - uses: actions/checkout@v4
      - run: npm ci && npm run lint:schema && npm run test
  publish-schema:
    if: github.ref == 'refs/heads/main'
    steps:
      - run: npx rover subgraph publish org@current --name users --schema ./schema.graphql
  compose-and-canary:
    steps:
      - run: npx rover supergraph compose --profile strict > supergraph.graphql
      - run: ./deploy_router_canary.sh supergraph.graphql

Appendix X — Testing Strategy Details

- Router integration: mock subgraphs via HTTP fixtures
- Subgraph integration: real DB containers; dataloaders; auth hooks
- Contract: example ops validated against composed graph in CI
- Load tests: persisted ops with realistic variable distributions

Appendix Y — Migration Playbooks

- Monolith → subgraphs: extract entity at a time; proxy unknown fields to monolith
- REST → graph: facade subgraph translating to REST; deprecate endpoints gradually

Appendix Z — Examples

extend type Order @key(fields: "id") {
  id: ID! @external
  total: Money @provides(fields: "currency")
  currency: String @external
}

Appendix AA — Gateway Resiliency Patterns

- Timeouts per subgraph; hedged requests sparingly; circuit breakers
- Partial responses for non-critical fields; user-facing fallbacks
- Bulkhead isolation: limit concurrency per subgraph
- Brownout mode: omit expensive fields under load via @defer/@stream

Appendix AB — Request Shaping and Query Hints

- Encourage clients to request minimal sets; provide profiles (lite/full)
- Use named fragments to standardize shapes; registry validates usage

Appendix AC — CDN Integration

- Only persisted GET at edge; POST → router; vary headers: auth, tenant, locale
- Stale-while-revalidate for semi-static fields; purge on mutations

Appendix AD — Planner Optimization

- Coalesce entity fetches; prefer fewer hops; align @requires with data model
- Avoid cross-subgraph fanout on hot paths; denormalize via @provides where safe

Appendix AE — Error Taxonomy

- USER_INPUT, AUTH, PERMISSION, NOT_FOUND, RATE_LIMIT
- SYSTEM, TIMEOUT, UPSTREAM, PARTIAL_DATA, UNKNOWN
- Map consistently at router; reserve extensions for machine handling

Appendix AF — Security and Privacy

- PII minimization; access logs redacted; privacy review for new fields
- Threat model: query abuse, caching leaks, introspection misuse

Appendix AG — Subgraph Boundaries

- Owners maintain single source of truth; avoid duplicate business logic
- Shared libraries for auth/context only; no cross-domain data coupling

Appendix AH — Contracts and Variants

- Variant per client/app version; hide unstable fields
- Deprecate with usage telemetry; remove after SLO window

Appendix AI — Operational Budgets

- Router p95 < 120ms; planner miss rate < 1%; cache hit > 60%
- Subgraph p95 < 200ms; error < 1%; fanout < 5 per op median

Appendix AJ — Multi-Region and Failover

- Global router with geo routing; local subgraphs per region
- Cross-region failover for reads; write affinity home region

Appendix AK — DataLoader Playbook

// Keyed by viewer scope when necessary to avoid cache bleed
const byId = new DataLoader(ids => batch(ids), { cacheKeyFn: toViewerScopedKey });

Appendix AL — Persisted Operations Rollout

- Phase 1: log-only unknown ops; Phase 2: warn; Phase 3: block
- Emergency allowlist for break-glass ops with expiry

Appendix AM — Complexity Budgeting

- Per-tenant budgets; VIP tiers; adjust by concurrency and historical usage
- Dynamic budgets during incidents to shed load safely

Appendix AN — Subscriptions Topology

- Shared websocket layer; subgraphs via event bus; backpressure and limits

Appendix AO — Defer/Stream UX

- Show skeleton for deferred parts; stabilize layout to avoid CLS

Appendix AP — Observability: PromQL

# Router error rate
sum(rate(router_requests_total{status=~"5.."}[5m])) / sum(rate(router_requests_total[5m]))

# Subgraph p95
histogram_quantile(0.95, sum(rate(subgraph_request_duration_seconds_bucket[5m])) by (le, subgraph))

# Planner miss
sum(rate(router_planner_miss_total[5m]))

Appendix AQ — Grafana Dashboard (Sketch JSON)

{
  "title": "Supergraph Overview",
  "panels": [
    {"type":"stat","title":"Router p95","targets":[{"expr":"histogram_quantile(0.95, sum(rate(router_request_duration_seconds_bucket[5m])) by (le))"}]},
    {"type":"table","title":"Subgraph Errors","targets":[{"expr":"sum(rate(subgraph_requests_total{status=~'5..'}[5m])) by (subgraph)"}]}
  ]
}

Appendix AR — Alert Library (YAML)

- alert: RouterHighErrorRate
  expr: (sum(rate(router_requests_total{status=~"5.."}[5m])))/(sum(rate(router_requests_total[5m]))+1e-9) > 0.02
  for: 10m
- alert: SubgraphLatencyP95High
  expr: histogram_quantile(0.95, sum(rate(subgraph_request_duration_seconds_bucket[5m])) by (le, subgraph)) > 0.300
  for: 10m

Appendix AS — CI Gates

- composition: fail on breaking changes and invalid references
- contracts: block removal of contracted fields used by clients
- persisted ops: require registry publish before router deploy

Appendix AT — Example Repos

repos/
  users-subgraph/
  orders-subgraph/
  catalog-subgraph/
  supergraph-router/
  contracts/

Appendix AU — Case Study: Checkout

- Query fetches cart (catalog), user (users), prices/tax (pricing), inventory
- Defer reviews and recommendations
- Canary: drop recs under load; preserve checkout core path

Appendix AV — Error Handling Examples

// Router error mapping
if (err.code === 'ECONNABORTED') classify('TIMEOUT');

Appendix AW — Pagination and Connections

type Connection { edges: [Edge!]!, pageInfo: PageInfo! }

Appendix AX — Federation with Mesh/Mercurius

- Mesh to unify REST/GraphQL/gRPC into graph; use as subgraph source
- Mercurius Federation v2 for Node-based subgraphs

Appendix AY — Security: SSRF, RCE, Injection

- Strict outbound allowlists; sanitize URLs; avoid dynamic eval; validate inputs

Appendix AZ — Runbooks

- Spike in 5xx: identify subgraph; circuit break; reduce query budgets; rollback
- Planner miss surge: warm caches; review op churn; deploy plan cache increase
- Cache stampede: introduce jitter TTL; precompute hot paths

Mega FAQ (401–1000)

  1. Do we encrypt at field level?
    For highly sensitive fields; store ciphertext; decrypt at viewer edge.

  2. Should we block introspection?
    For public clients, yes; allow for trusted admin behind auth.

  3. How to keep router hot?
    Warm plan and result caches; pin CPU; avoid GC pauses; keep binaries slim.

  4. Why do we see N+1 at router?
    It’s subgraph N+1 surfacing; fix with dataloaders and batch APIs.

  5. Pagination best practice?
    Connections with opaque cursors; avoid offset for large datasets.

  6. Can we stream errors?
    Use incremental delivery; attach errors to path; keep UX stable.

  7. How to phase out a subgraph?
    Stop extensions; move ownership; mark deprecated; remove after usage=0.

  8. Query cost vs depth?
    Use cost map; depth alone is insufficient.

  9. Multi-tenant guardrails?
    Per-tenant budgets, rate limits, caches; audit access.

  10. Final: prefer clarity, constrain costs, measure constantly.


Appendix BA — Client Guidance and Operation Hygiene

- Use named operations and fragments; include client name/version headers
- Prefer persisted queries; avoid ad-hoc POST in production
- Request minimal fields; defer non-critical sections; paginate large lists
- Consider offline caching and stale-while-revalidate patterns

Appendix BB — Router Horizontal Scaling

- Stateless router replicas behind L4; sticky only for websockets if needed
- Preload supergraph; watch registry for changes; SIGHUP or hot-reload
- Limit max concurrent per replica; tune threadpool/event-loop

Appendix BC — Supergraph Lifecycle

- Draft composition → canary → 25% → 100%; abort on error budget burn
- Rollback by pinning previous supergraph; invalidate router caches
- Track composition ID, commit SHAs, owner, change ticket

Appendix BD — Mutations and Side-Effects

- Keep mutations in owning subgraph; avoid multi-subgraph transactional semantics
- Emit domain events; eventual consistency for projections
- Idempotency keys for retries; return stable mutation payload shapes

Appendix BE — File Uploads and Binary Data

- Prefer signed URLs via REST; graph returns metadata and URLs
- Limit upload size; virus/malware scanning; audit trails

Appendix BF — Partial Data UX

- Render available fields with skeletons; show inline warnings for gaps
- Avoid blocking primary flows on optional subgraph outages

Appendix BG — Schema Style Guide

- Nouns for types, verbs for mutations; consistent pagination
- Use ISO-8601 timestamps; Money type with currency; IDs opaque
- Avoid leaking storage/DB fields; present domain-centric names

Appendix BH — Federation with Legacy Backends

- Subgraph as facade: translate to REST/gRPC/DB; cache hot lookups
- Stabilize latency with bulk endpoints; prefer batched loaders

Appendix BI — Error Extensions Contract

{
  "errors": [
    {
      "message": "Not authorized",
      "extensions": {
        "code": "PERMISSION",
        "subgraph": "orders",
        "requestId": "..."
      },
      "path": ["order", "total"]
    }
  ]
}

Appendix BJ — CORS and Edge Security

- Strict origins; credentials only when needed; same-site cookies preferred
- HSTS, CSP; block mixed content; validate referer/origin on sensitive ops

Appendix BK — Subgraph Read Models

- Pre-compute aggregates; materialized views; denormalize sparingly
- Keep entity ownership; expose read-friendly fields

Appendix BL — Safe Defaults

- Limit depth to 10; cost budget per op; request timeout 3s
- Disable schema introspection for public clients; enable for admin

Appendix BM — Canary Recipes

- Shadow traffic to new subgraph version; compare error/latency
- Gradual router rollout of new supergraph; per-client cohort testing

Appendix BN — Edge Authorization Tokens

- Short-lived JWTs; rotate keys; anti-replay; audience and scope validated

Appendix BO — Latency Budgets

- Router < 120ms p95; subgraph < 200ms p95 each; cumulative budget by path
- Automatic brownouts when over budget: defer/omit optional fields

Appendix BP — SDL Lint Rules

- Disallow null list elements; consistent naming; no dangling types
- Require descriptions on public fields; deprecations include rationale

Appendix BQ — Plan Cache Warmers

- Periodically execute top persisted queries to warm plan/result caches
- Bust on composition changes; record hit/miss metrics

Appendix BR — Multi-Tenant Abstractions

- Tenant-aware dataloaders; tenant-scoped caches; tenant quotas and limits

Appendix BS — Backpressure and Load Shedding

- Queue caps; 429 with Retry-After; degrade non-critical fields

Appendix BT — Secrets and Config

- Router and subgraphs via secure env vars or secrets; no secrets in SDL
- Rotate credentials regularly; audit access

Appendix BU — Infra Templates (Kubernetes)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 4
  maxReplicas: 20
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }

Appendix BV — Gateway Feature Flags

- Enable/disable features by client or cohort; test in prod safely

Appendix BW — Data Privacy Regions

- Route EU users to EU subgraphs; ensure data residency with policy

Appendix BX — Operator Runbooks

- Router CPU spike → check plan miss, request surge, cache stampede
- Subgraph 5xx spike → circuit break, page owner, roll back latest change

Appendix BY — Docs and Developer Experience

- Self-serve SDL portal; examples; persisted op explorer; performance tips

Appendix BZ — Cost Controls

- Cache hot reads; batch; avoid over-fetch; suppress heavy optional fields

Observability: Panel JSON (Sketch)

{
  "title": "Federation Health",
  "panels": [
    {"type":"stat","title":"Router p95"},
    {"type":"table","title":"Subgraph 5xx by svc"},
    {"type":"graph","title":"Plan cache hit"}
  ]
}

Policies (Pseudo-OPA)

package graph.policy

violation["no_latest_images"] {
  input.kind == "Deployment"
  some c
  c := input.spec.template.spec.containers[_]
  endswith(c.image, ":latest")
}

Templates: Persisted Ops Enforcement (Edge)

location /graphql {
  if ($request_method = GET) { # persisted only
    proxy_pass http://router;
  }
  if ($request_method = POST) { return 405; }
}

Case Study — Profile Page

- User (users), purchases (orders), recommendations (recs)
- Defer recs; cache user; stream purchases

Mega FAQ (1001–1200)

  1. Do we allow arbitrary queries in prod?
    No—persisted queries only for public clients; log/deny unknown.

  2. How big should a subgraph be?
    Team-sized, domain-aligned; avoid too fine-grained fragmentation.

  3. Who owns entity keys?
    The entity’s owning subgraph; others must not redefine keys.

  4. Can we have cross-DB transactions?
    Avoid; use events and compensation.

  5. Should we expose IDs or slugs?
    Opaque IDs for references; slugs as fields where useful.


Mega FAQ (1201–1400)

  1. Why depth limits are insufficient?
    Low-depth fields can be expensive; use cost maps.

  2. Can router rewrite queries?
    Prefer not; only for technical normalization; never change semantics.

  3. Where to authorize?
    At edge for identity, in subgraphs for domain constraints.

  4. Can we cache POST?
    Not at CDN; router internal caches okay with safe keys.

  5. Should we enable introspection?
    For trusted tools; disable for public paths.


Mega FAQ (1401–1600)

  1. How to sunset a field?
    Deprecate, track usage, communicate, remove after window.

  2. When to use @provides?
    When one subgraph can reliably supply a field owned by another with clear contract.

  3. Handle heavy lists?
    Paginate, limit, and stream; precompute where possible.

  4. Why plan cache misses?
    New ops or variable shapes; warm caches; standardize client fragments.


Mega FAQ (1601–1800)

  1. Should we allow mutations across subgraphs?
    Keep in single owner; orchestrate via events.

  2. Rate limit at edge or router?
    Edge for coarse limits; router for per-op budgets.

  3. Can we run router at the edge?
    Yes with wasm/native binaries; test cold starts and limits.

  4. Final: keep graphs lean, plans hot, and costs bounded.


Appendix CA — Gateway Deployment Patterns

- Sidecar telemetry; init to fetch supergraph; read-only FS; non-root user
- Blue/green router rollout; health gates on p95 and error rate
- Multi-tenant router fleets for isolation and quota enforcement

Appendix CB — Supergraph Delivery

- Signed supergraph bundles; checksum verification; fallback to last-good
- Progressive rollout by region and client cohort

Appendix CC — Schema Lint Pack

- Require descriptions; ban ID-as-Int; enforce Money scalar usage
- Disallow nullable lists of non-nullable elements unless justified

Appendix CD — Router Warmup and Preload

- Preload top persisted ops; synthetic traffic warmers; plan cache hydrate

Appendix CE — Hot Paths Catalog

- Identify top 20 operations by volume; publish budgets and owners
- Quarterly review of hot path latency and cost

Appendix CF — Error Budgets and Freeze

- Burn >2x for 60m: freeze router and subgraph deploys; focus on recovery

Appendix CG — Redaction and Privacy Tests

- Unit tests: variables redacted; logs scrubbed; traces tagged without PII

Appendix CH — Client Contract Testing

- Validate example queries per client against composed graph; diff on changes

Appendix CI — Schema Change Classes

- Safe: new types/fields with defaults; new queries
- Risky: required input fields; enum removals; field removals
- Block: breaking changes without deprecation window and 0 usage

Appendix CJ — Router Rate Limiting

- Token bucket per client and tenant; budgets per operation class
- 429 with Retry-After; degrade optional fields under pressure

Appendix CK — Subgraph SLOs and Ownership

- Each subgraph: latency/error SLOs; on-call; dashboards; runbooks

Appendix CL — Observability Fields

- route=operationName, clientName, variant, subgraph, planSteps, fanout

Appendix CM — Replay and Backfills

- Use idempotent loaders; replay cached persisted ops to warm downstream caches

Appendix CN — Edge Rules

- Block GraphQL playground in prod; CSP strict; CORS allowlist

Appendix CO — Capturing Usage

- Per-field usage; deprecation candidates; client impact analysis

Appendix CP — Threat Model Highlights

- Abuse via expensive queries; cache key leaks; SSRF through resolvers
- Mitigate with cost ceilings, redaction, outbound allowlists

Appendix CQ — Synthetic Monitors

- Canary persisted ops with thresholds; alert on regressions

Appendix CR — Chaos Scenarios

- Kill one subgraph; check brownout UX and partial data resilience
- Router rollout with bad bundle; ensure last-good fallback

Appendix CS — DR and Failover

- Router multi-region anycast; subgraphs active/active or active/passive
- Test failover quarterly with evidence

Appendix CT — Access Patterns

- Public vs private graphs; partner variants; admin ops separated

Appendix CU — Governance Dashboards

- Composition frequency; breaking change attempts; usage of deprecated fields

Appendix CV — Cost Dashboards

- Router CPU/mem per op; subgraph cost per k requests; cache ROI

Appendix CW — Developer Tooling

- CLI to generate persisted ops; schema diffs; usage reports per team

Appendix CX — Education

- Playbooks for defer/stream, pagination, caching, security

Appendix CY — Backward Compatibility Windows

- 60–90 days typical; faster with client feature flags and phased rollout

Appendix CZ — Final Principles

- Stable contracts, bounded costs, observable systems, and fast, safe delivery

Operations Runbooks (Extended)

Incident: Router 5xx spike
- Compare to client/version; check subgraph breakdown; circuit break worst
- Reduce budgets; enable brownout; roll back supergraph if composition changed

Incident: Cache stampede
- Increase TTL with jitter; warmers; protect backend with concurrency caps

Incident: Planner miss surge
- Identify new ops; enforce persisted; pre-warm; client guidance

Dashboards (Sketch JSON)

{
  "title": "GraphQL Router",
  "panels": [
    {"type":"stat","title":"p95","targets":[{"expr":"histogram_quantile(0.95,sum(rate(router_request_duration_seconds_bucket[5m])) by (le))"}]},
    {"type":"graph","title":"Cache Hit","targets":[{"expr":"sum(rate(router_cache_hit_total[5m]))/sum(rate(router_cache_total[5m]))"}]},
    {"type":"table","title":"Subgraph Errors","targets":[{"expr":"sum(rate(subgraph_requests_total{status=~'5..'}[5m])) by (subgraph)"}]}
  ]
}

Mega FAQ (1801–2000)

  1. Can we auto-defer under load?
    Yes—brownout mode defers/omits optional fields based on budgets.

  2. Should we log variables?
    Redacted only; never PII; sample with care.

  3. Why partial data errors upset clients?
    Educate and standardize UI patterns; keep core flows intact.

  4. How to bound cost for partners?
    Contracts + budgets + rate limits + persisted-only.


Mega FAQ (2001–2200)

  1. Do we pin digests for router image?
    Yes for prod; signed; provenance verified.

  2. Multi-cloud supergraph?
    Possible with registry sync and region-local subgraphs.

  3. How to find deprecation targets?
    Field usage telemetry sorted by zero-usage period.

  4. Are mutations cacheable?
    No; but mutation results can prime read caches.


Mega FAQ (2201–2400)

  1. What breaks composition most?
    Key mismatches, conflicting type ownership, incompatible field nullability.

  2. How to test subscriptions at scale?
    Synthetic publishers; backpressure; soak tests; fanout metrics.

  3. Should we expose enums?
    Yes with caution; include UNKNOWN; version with care.

  4. Final: keep graphs clean, fast, and secure—optimize for maintainability.


Appendix DA — Supergraph Change Windows

- Define weekly windows for high-risk composition changes
- Auto-pause canary during incidents; require explicit resume

Appendix DB — Client Upgrade Strategy

- Contract variants per app version; deprecate with telemetry
- Feature flags map to fields; phased rollout by cohort

Appendix DC — Router Sandbox Mode

- Dry-run new supergraph in parallel; compare plans and latencies
- Toggle per-client cohort; no user impact during validation

Appendix DD — Subgraph Health Contracts

- Liveness: DB connectivity, cache reachability
- Readiness: migrations applied, warm caches, dependencies healthy

Appendix DE — Gateway Canary SLOs

- p95 latency within 10% of baseline; error rate delta < 0.5%
- Plan cache hit within 5% of baseline after 10 minutes

Appendix DF — Traffic Shaping

- Shift by client, region, or op-class; protect hot paths first
- Brownout overrides for optional fields; defer under load

Appendix DG — Subgraph Outage Playbook

- Circuit break; mark fields as deferred/omitted; communicate status
- Warm fallback caches where possible; page on-call and track MTTR

Appendix DH — Writer Isolation

- Route writes to home region; read replicas elsewhere; expose version stamps
- Avoid multi-subgraph transactions; orchestrate via events

Appendix DI — Cost Attribution

- Attribute router CPU/mem per operation and client
- Attribute subgraph cost by op fanout and latency; report to owners

Appendix DJ — SDL Conventions

- Use @deprecated with reason; include removal date
- Avoid leaking storage keys; opaque IDs with node lookup when needed

Appendix DK — Query Shape Catalog

- Catalog top shapes; provide named fragments; enforce via lints

Appendix DL — Subgraph Release Trains

- Batch subgraph releases weekly; align with supergraph comps
- Reduce composition churn and planner cache misses

Appendix DM — Router Resource Profiles

- CPU-bound vs IO-bound; adjust threads; pre-alloc arenas; GC tuning

Appendix DN — Persistence and Caches

- Redis/memcached for router result cache; LRU; per-tenant partitions

Appendix DO — Schema Diff Bots

- PR bot annotates diffs, breaking risks, usage impact, owners to review

Appendix DP — SLA and Budgets Dashboard

{
  "title": "Supergraph SLOs",
  "panels": [
    {"type":"stat","title":"Router SLO"},
    {"type":"stat","title":"Subgraph Error Budget Burn"}
  ]
}

Appendix DQ — Access Reviews

- Quarterly review of client access, variants, and scopes
- Revoke stale keys; rotate credentials

Appendix DR — Secrets and Key Management

- JWKS rotation; HSM/KMS for signing; key expiry policies

Appendix DS — Regionalization and Residency

- Split subgraphs per region; forbid cross-region PII via policy

Appendix DT — Playground and Tooling

- Internal dev-only; pre-auth; rate limits; operation registry integration

Appendix DU — Observability Annotations

- Tag spans with op class (hotpath|bulk|admin), client tier, cost estimate

Appendix DV — Rollback Evidence

- Store comparison charts for before/after; attach to incident and PR

Appendix DW — Partner Integrations

- Contract variants; SLA per partner; sandbox environments; sample data

Appendix DX — Education Path

- 101 GraphQL; 201 Federation; 301 Observability and Cost; 401 Security

Appendix DY — Data Classifications

- PUBLIC, INTERNAL, CONFIDENTIAL, SECRET; field annotations; policy gates

Appendix DZ — Final Operating Principles

- Contracts first; costs bounded; caches warm; rollouts safe; evidence always

Extended Templates

# Money scalar
scalar Money

# Node interface
interface Node { id: ID! }
# Router Deployment hardened
securityContext:
  runAsNonRoot: true
  readOnlyRootFilesystem: true
// Router middleware pseudo
function beforeResolve(ctx) {
  ctx.vars.requestStart = Date.now();
  ctx.vars.cost = estimateCost(ctx.operation);
  if (ctx.vars.cost > ctx.budget) throw new Error('COST_EXCEEDED');
}

Mega FAQ (2401–2600)

  1. Why do plan cache misses spike?
    New ops/contracts; warmers; reduce op churn; guide client fragments.

  2. Can we proxy REST as a subgraph?
    Yes via Mesh or facade; ensure batching and caching to avoid N+1.

  3. Is GraphQL suitable for all endpoints?
    Not binary uploads; keep those in REST; graph returns metadata.

  4. How to detach a subgraph?
    Move ownership; deprecate fields; drain traffic; remove from supergraph.


Mega FAQ (2601–2800)

  1. Should public clients have separate router?
    Prefer separate fleet with stricter policies and caching.

  2. How to test composition failures?
    Inject schema conflicts in staging; verify gates block deploys.

  3. Are enums safe for partners?
    Include UNKNOWN; evolve carefully; provide contracts per partner.

  4. When to pagination vs windowing?
    Paginate user lists; window analytics; stream where appropriate.


Mega FAQ (2801–3000)

  1. Defer/stream impact on SEO?
    Server-render critical content; hydrate incrementals; use placeholders.

  2. Should we cap list sizes?
    Yes; hard caps; client-friendly errors; links to next pages.

  3. Is GraphQL cacheable at CDN?
    Persisted GET only; vary keys strictly; purge on mutations.

  4. Final: simplicity wins—clean schemas, stable ops, and tight ops.


Mega FAQ (3001–3200)

  1. Cost budgets per tier?
    Yes: free, pro, enterprise with increasing caps and support.

  2. Run router at edge?
    Possible; ensure plan cache warm, memory caps, and cold-start tests.

  3. Multi-tenant fairness?
    Weighted queues, per-tenant budgets, strict isolation for heavy tenants.

  4. Last word: observable, predictable, and efficient graphs at scale.


Appendix EA — Partner Sandbox and Throttles

- Separate variants and routers; synthetic data; strict budgets and persisted-only

Appendix EB — Budgeted Operations Catalog

- Define allowed ops per client tier; publish limits; auto-annotate violations

Appendix EC — Evidence Automation

- Attach dashboards, traces, composition diffs, and PRs to every rollout

Appendix ED — Education and Checklists

- Pre-merge: schema lint, composition ok, deprecations reviewed, cost within budget
- Pre-release: caches warm, canary plan, rollback steps, owners on-call

Appendix EE — Closing Notes

- Federate when teams are ready; keep contracts lean; measure and iterate

Final FAQ (3201–3400)

  1. Should we allow inline fragments everywhere?
    Yes, but encourage named fragments for reusability and cacheability.

  2. Are schema stitches via REST OK?
    Yes via Mesh/facades with batching; watch latency and error surfaces.

  3. Can we expose GraphiQL in prod?
    Not for public; internal behind auth and rate limits only.

  4. When to split a subgraph?
    When ownership or scaling diverges; avoid ping-ponging entities.

  5. Final: stable supergraph, fast router, efficient subgraphs—own your slice well.


Quick Reference

- Persisted ops only for public
- Bound costs and depths
- Warm plan/result caches
- Observe p95, error, fanout
- Defer/stream non-critical fields

Troubleshooting Index

- High p95: check subgraph; cache hit; plan miss; hot paths
- Many 5xx: classify errors; circuit break; rollback recent changes
- Composition fail: key conflicts; nullability; directives mismatch

Additional FAQ (3401–3600)

  1. How to handle client cache invalidation?
    Return entity version fields; use cache policies; purge on mutation.

  2. Fallback when a subgraph is slow?
    Defer optional fields; set timeouts; render partial data with notices.

  3. Do we expose node interface?
    When helpful for universal fetches; ensure opaque IDs.

  4. Final: contracts, costs, caches, and care.


Closing

Federation succeeds when domain ownership is clear, contracts are stable, and router/subgraphs are observable and resilient. Keep the supergraph lean, plan caches warm, and client operations disciplined.


Appendix EF — Gateway Cold Start Strategy

- Pre-bake router image with dependencies; lazy-load optional modules
- Warm plan/result caches after deploy using top persisted ops
- Keep supergraph bundle local with checksum verification

Appendix EG — Tenant Fairness and Quotas

- Weighted fair queue per tenant; enforce per-op and per-minute budgets
- Distinct queues for hot paths vs bulk analytics to avoid starvation

Appendix EH — Error Budgets per Path

- Track error budgets per operation class; freeze risky schema changes on burn
- Route suspicious traffic to lower-cost variants during incidents

Appendix EI — Subgraph Cache Contracts

- Define cache invariants for read models; TTLs; invalidation hooks on mutations
- Expose version fields and lastUpdated timestamps for client-side reconciliation

Appendix EJ — Router Memory Hygiene

- Cap result size; stream large lists; enforce max response bytes
- Tune alloc arenas; monitor GC pauses; pre-size buffers for hot paths

Appendix EK — Partner Governance

- Separate partner variants with strict schemas and cost ceilings
- Contract SLAs, deprecation calendars, and emergency kill switches

Appendix EL — Security Headers and TLS

- HSTS, CSP (report-only then enforce), COOP/COEP for isolation
- mTLS to subgraphs; pin CA; rotate certs automatically

Appendix EM — Router Feature Flags

- Gradually enable planner optimizations, caching strategies, and brownout rules
- Flags are persisted and audited; roll back instantly if regress

Appendix EN — Cost Visibility for Teams

- Dashboard cost per field and operation; show top offenders per team
- Quarterly reviews to reduce over-fetch and normalize shapes

Appendix EO — Data Residency Enforcement

- Policy denies field access outside region; annotate SDL with residency class
- Router selects region-local subgraphs; fallback only for public data

Appendix EP — SDL Documentation Automation

- Generate docs per variant; include usage charts and deprecation timelines
- Link to persisted operations catalog and client examples

Appendix EQ — Blue/Green Subgraphs

- Run v1 and v2 side-by-side; router targets cohort; compare KPIs
- Drain old after success; archive metrics and composition snapshot

Appendix ER — Incident Drill Catalog

- Subgraph timeout storm; router cache miss spike; partner abuse burst
- Each drill has playbook, metrics targets, and evidence bundle template

Appendix ES — Access Tokens and JWKS

- Rotate signing keys; cache JWKS with short TTL; fail closed on mismatch
- Support key rollover with overlapping validity windows

Appendix ET — Plan Cache Engineering

- Keyed by signature and stable variable shapes; bucketized for memory caps
- Evict LRU; protect hot entries; export hit/miss and evictions

Appendix EU — Result Cache Engineering

- Respect cache-control hints; per-tenant segments; compression on
- Invalidate on mutation topics; jitter TTLs to avoid herds

Appendix EV — Subgraph Sandboxes

- Ephemeral envs for schema experiments; compose against staging supergraph
- Synthetic data sets; abuse testing; performance benchmarks

Appendix EW — Ops CLI

- Commands: diff-supergraph, warm-caches, dump-hotpaths, block-op, lift-freeze

Appendix EX — Education Tracks

- Clients: operation hygiene and caching; Subgraphs: data loaders and auth
- Operators: observability deep dive; Security: privacy and attack surfaces

Appendix EY — Privacy Reviews

- Checklist per new field: classification, retention, residency, purpose
- Automated scanning to flag risky names and patterns (e.g., ssn, email)

Appendix EZ — Final Practices

- Own domains; keep contracts minimal; budget costs; measure relentlessly

Mega FAQ (3601–4000)

  1. Why do we still see timeouts with cached ops?
    Subgraph latency dominates; cache helps router but not slow sources.

  2. Should we allow multi-hop entity chains?
    Limit; deep chains explode latency. Flatten with views or denormalization.

  3. How to guard against schema sprawl?
    Lint packs, ownership, review gates, and usage-based pruning.

  4. Can clients pick variants dynamically?
    Yes via headers; enforce allowlists; migrate gradually.

  5. How to handle rogue clients?
    Block at edge; revoke keys; quarantine ops; notify owners.

  6. Router CPU spikes after deploy?
    Plan cache cold; warmers; reduce composition churn; profile GC.

  7. Are batched resolvers always better?
    Usually, but watch peak memory; cap batch sizes; stream results.

  8. Should we allow file uploads through graph?
    Prefer signed URLs via REST; keep graph metadata-only.

  9. Federation vs BFFs?
    Federation centralizes contracts; BFFs can coexist as clients of the graph.

  10. Final: clarity, constraints, and continuous care keep graphs healthy.

Related posts