Cloud Migration Strategies: Lift, Shift, and Refactor (2025)

Oct 26, 2025•

cloudmigrationlanding-zonerefactor

•

Migrations succeed with disciplined assessment, wave plans, and strong landing zones. This guide provides a pragmatic, risk-aware approach.

Executive summary

Inventory apps; score complexity and business criticality; create waves
Establish secure landing zones; network, identity, logging, guardrails
Mix strategies: rehost (fast), replatform (balanced), refactor (targeted)

Assessment

7Rs mapping; dependency analysis; readiness scorecards; exit risks

Landing zones

Baseline identity, network, policy, logging, cost controls; IaC-first

Migration factory

Repeatable pipelines; data sync/CDC; dress rehearsals; cutover playbooks

Refactor patterns

Strangler Fig; extract to managed services; event-driven decoupling

Success metrics

Cutover time, defect/leakage rates, performance, cost delta vs baseline

FAQ

Q: Rehost vs refactor?
A: Rehost for speed and simple stacks; refactor when benefits outweigh cost—often for critical, long-lived systems.

Executive Playbook

This guide provides a battle-tested migration blueprint: assessment → wave planning → landing zones → migration factory → cutovers → hardening → observability → cost and compliance.

1) Assessment and Portfolio Rationalization

1.1 Application Inventory Template

- App ID: APP-001
- Name: Payments API
- Owner: FinTech Platform
- Business Criticality: High
- RTO/RPO: 30m / 15m
- Dependencies: PostgreSQL, Redis, Kafka, S3
- Data Sensitivity: PII
- Peak TPS: 2,500
- Latency SLO: P95 < 150ms
- Compliance: PCI-DSS
- Current Footprint: 20 VMs, 4TB DB
- Migration Strategy: Replatform (managed DB, containers)

1.2 7Rs Mapping

- Rehost (Lift-and-Shift): Simple stateless services, legacy apps with minimal change window
- Replatform: Move to managed databases, containerize, adopt LB
- Refactor/Re-architect: Event-driven, microservices, serverless
- Repurchase: SaaS replacement for custom tools
- Retire: Decommission unused/low-value systems
- Retain: Keep on-prem for now due to constraints
- Relocate: VMware Cloud on AWS/Azure for quick move

1.3 Readiness Scorecard (Example)

app_id,area,score,notes
APP-001,Architecture,4,12-factor partial, externalized config
APP-001,Data,3,Large DB, needs CDC
APP-001,Ops,2,Manual deploys, no IaC
APP-001,Security,4,Good baseline, secrets need rotation
APP-001,Compliance,3,PCI scope; need evidence pipeline

2) Wave Planning and Risk Controls

2.1 Wave Plan Structure

- Wave 0 (Foundations): Landing zones, baseline services, shared networking
- Wave 1 (Low Risk): Internal tools, non-critical read services
- Wave 2 (Medium Risk): External APIs behind feature flags
- Wave 3 (High Risk): Payments, auth, customer-facing apps

2.2 Cutover Windows and Freeze Policy

- Code Freeze: 48h before cutover for Wave 3
- Change Window: Saturday 00:00–06:00 UTC
- Rollback Plan: Switch traffic back, DB fallback to primary

2.3 Go/No-Go Checklist (Excerpt)

- Health checks green across new environment
- Monitoring and alerts configured and tested
- Runbook printed; stakeholders on call
- Backups validated; restore tested in staging

3) Landing Zones (AWS / Azure / GCP)

3.1 Core Principles

- Identity-first: SSO, least privilege, role-based access
- Network segmentation: hub-spoke, private subnets, egress controls
- Logging/Monitoring: centralized, immutable storage
- Guardrails: policies as code, pre-commit and admission controls
- IaC: everything codified and reviewed

3.2 AWS Landing Zone (Terraform)

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  name    = "lz-vpc"
  cidr    = "10.0.0.0/16"
  azs     = ["us-east-1a","us-east-1b","us-east-1c"]
  private_subnets = ["10.0.1.0/24","10.0.2.0/24","10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24","10.0.102.0/24","10.0.103.0/24"]
  enable_nat_gateway = true
}

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  cluster_name    = "lz-eks"
  cluster_version = "1.29"
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.private_subnets
}

resource "aws_cloudtrail" "this" {
  name                          = "lz-trail"
  s3_bucket_name                = aws_s3_bucket.logs.bucket
  include_global_service_events = true
  is_multi_region_trail         = true
}

resource "aws_s3_bucket" "logs" { bucket = "lz-org-logs" }

3.3 Azure Landing Zone (Bicep)

param location string = 'eastus'
resource vnet 'Microsoft.Network/virtualNetworks@2022-11-01' = {
  name: 'lz-vnet'
  location: location
  properties: {
    addressSpace: { addressPrefixes: ['10.10.0.0/16'] }
    subnets: [
      { name: 'private-a', properties: { addressPrefix: '10.10.1.0/24' } },
      { name: 'public-a', properties: { addressPrefix: '10.10.101.0/24' } }
    ]
  }
}

3.4 GCP Landing Zone (Terraform)

module "vpc" {
  source  = "terraform-google-modules/network/google"
  version = "~> 7.0"
  project_id   = var.project_id
  network_name = "lz-vpc"
  subnets = [{ subnet_name = "private-a", subnet_ip = "10.20.1.0/24", subnet_region = "us-central1" }]
}

resource "google_logging_project_sink" "this" {
  name        = "lz-sink"
  destination = "storage.googleapis.com/${google_storage_bucket.logs.name}"
}

resource "google_storage_bucket" "logs" { name = "lz-org-logs" location = "US" }

4) Networking Patterns

- Hub-Spoke: centralized egress/ingress; spokes for workloads
- Private Link/Endpoints: connect to managed services privately
- Transit Gateway/Virtual WAN/Cloud Router: interconnect
- DNS Strategy: split-horizon, conditional forwarders
- Zero Trust: identity-aware proxies and policy

graph LR
  OnPrem((On-Prem))--IPSec-->Hub[Hub VPC/VNet]
  Hub--PrivateLink-->SaaS
  Hub--Peering-->Spoke1
  Hub--Peering-->Spoke2

5) Identity and Access Management

- SSO via IdP (Azure AD/Okta); SCIM for provisioning
- Roles: platform-admin, app-operator, auditor
- Workload Identity: IRSA (AWS), Workload Identity (GKE), Managed Identity (Azure)
- Secrets: KMS/KeyVault/Cloud KMS; short-lived tokens

6) Logging, Monitoring, and Tracing

- Centralize logs (CloudWatch/Log Analytics/Cloud Logging) with retention policies
- Metrics: Prometheus-compatible exporters; SLO dashboards
- Tracing: OpenTelemetry; propagate trace-context end-to-end

# OTEL collector snippet
receivers:
  otlp:
    protocols: { http: {}, grpc: {} }
exporters:
  otlphttp: { endpoint: http://tempo:4318 }
service:
  pipelines:
    traces: { receivers: [otlp], exporters: [otlphttp] }

7) Guardrails and Policies as Code

# .ai-policy.yml (example)
blocked_tools:
  - public_s3_write
allowed_regions:
  - us-east-1
  - us-west-2

package elysiate.guardrails

violation[msg] {
  input.resource.type == "aws_s3_bucket"
  input.resource.acl == "public-read"
  msg := sprintf("Public S3 bucket forbidden: %s", [input.resource.name])
}

8) Migration Factory Pipeline

graph TD
  A[Intake] --> B[Assess/7Rs]
  B --> C[Landing Zone Ready]
  C --> D[Environment Provisioning]
  D --> E[Data Migration]
  E --> F[App Cutover]
  F --> G[Hardening/Optimization]
  G --> H[Operate/Improve]

# GitHub Actions skeleton
name: migration-factory
on:
  workflow_dispatch:
jobs:
  provision:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init && terraform apply -auto-approve
  data-migration:
    needs: provision
    steps:
      - run: ./scripts/cdc_sync.sh
  cutover:
    needs: data-migration
    steps:
      - run: ./scripts/cutover_dns.sh

9) Data Migration: Strategies and Tools

9.1 CDC and Replication

# Debezium + Kafka + Target
connect-standalone worker.properties debezium-postgres.properties

- Validate: row counts, checksums, referential integrity
- Dual-writes during transition; idempotency keys

9.2 Bulk Transfer

aws s3 sync s3://onprem-backups s3://cloud-landing
azcopy copy 'https://src.blob.core.windows.net' 'https://dst.blob.core.windows.net' --recursive

9.3 Databases

- PostgreSQL: Logical replication, pg_dump/restore, DMS/Database Migration Service
- MySQL: GTID-based replication; pt-table-checksum
- SQL Server: AlwaysOn, DMS
- MongoDB: Oplog tailing, Atlas Live Migration

10) Cutover Patterns

- Big Bang: single window; fastest but riskiest
- Phased: route subsets of users or endpoints gradually
- Read-Only Freeze: freeze writes, final sync, switch
- Shadow Traffic: mirror reads to new environment for soak testing

# DNS cutover with health check
aws route53 change-resource-record-sets --hosted-zone-id ZZZ --change-batch file://dns.json

11) Testing and Validation

- Functional: smoke tests, key flows
- Performance: load tests vs baseline SLOs
- Resilience: chaos tests (kill nodes, AZ failovers)
- Security: penetration tests; vuln scans; secrets checks
- Compliance: evidence generation and traceability

12) Security and Compliance

- Encryption: data at rest and in transit everywhere
- Key Management: KMS/KeyVault/Cloud KMS; rotation
- Identity: least privilege, short-lived creds, break-glass procedures
- Supply Chain: SBOM, SLSA attestations, signed images (Sigstore)
- Logging: tamper-evident, write-once storage for audits

13) Cost and FinOps

- Right-size instances; use autoscaling and spot/savings plans
- Storage lifecycle policies; compress logs; cold storage
- Tagging: cost allocation per app/team/environment
- Budget alerts and anomaly detection

service,baseline_usd,post_migration_usd,delta
compute,12000,8500,-3500
storage,4000,3100,-900
egree,1500,2000,+500
managed_db,0,2200,+2200

14) Observability and SLOs

- Availability SLOs per service; error budgets
- Golden signals: latency, errors, saturation, traffic
- Runbooks with clear mitigations and paging policies

# Error budget burn (example)
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

15) Operations Runbooks (Examples)

15.1 Rollback

- Trigger: elevated errors after cutover
- Actions: flip DNS/traffic manager to previous target; unfreeze old DB; revert IaC drift
- Validate: metrics back to baseline; announce status

15.2 Latency Spike

- Check: network path (VPC peering, NAT, security groups)
- Mitigate: increase instance size or replicas; enable HTTP keep-alive; warm caches

15.3 Data Drift Detected

- Halt writes; capture diffs; re-run CDC; reconcile by idempotency

16) Reference Architectures

graph TD
  Users --> CDN
  CDN --> WAF
  WAF --> ALB[Load Balancer]
  ALB --> App[Containers/Functions]
  App --> DB[Managed DB]
  App --> Cache[Redis/Memcached]
  App --> Queue[SQS/PubSub/Service Bus]
  Logs --> SIEM

17) Templates and Checklists

- Pre-Migration Checklist: backups, IaC plans, alerts, dashboards, comms plan
- Cutover Checklist: approvals, go/no-go, rollback steps, paging rosters
- Post-Migration: validate SLOs, decommission old, cost checks

JSON-LD

Zero Trust Architecture: Implementation Guide (2025)
API Security: OWASP Top 10 Prevention Guide (2025)
Observability with OpenTelemetry: Complete Implementation Guide (2025)
Multi-Cloud Strategy: Vendor Lock-in Prevention (2025)

Call to Action

Need a migration factory and reliable cutovers? We design landing zones, pipelines, and guardrails to move with confidence.

Extended FAQ (1–120)

How do I choose between rehost and refactor?
Start with business drivers and constraints. Rehost to move quickly; refactor when long-term benefits justify investment.
What is a landing zone?
A pre-configured, secure foundation: identity, network, logging, guardrails, and baseline services delivered as code.
How do I minimize downtime?
Use CDC for DBs, dual-run services, shadow traffic, and plan a short read-only window for final sync.
What about data gravity?
Stage bulk data first; run CDC to capture deltas; prioritize large datasets early.
How do I manage credentials?
Use cloud-native secrets managers, short-lived tokens, and workload identity.
What are common migration risks?
Underestimated dependencies, lack of observability, insufficient testing, and missing rollback plans.
Do I need a service mesh?
Only if required by traffic policies, multi-cluster, or advanced observability; start simple.
Should I adopt containers or serverless?
Depends on workload; containers for steady-state, serverless for bursty/async.
How do I keep costs under control?
Right-size, autoscale, reserved/spot instances, aggressive lifecycle and tagging; monitor anomaly.
What is the best cutover strategy?
Phased cutovers with traffic splitting and fast rollback; big bang only when low risk.
How do I handle PCI/HIPAA/GDPR?
Map controls to cloud services, automate evidence, and ensure data residency.
What is a migration factory?
A standardized set of processes and pipelines that repeat the migration steps reliably across apps.
Should I multi-cloud?
Only with clear ROI; otherwise, single-cloud depth wins initially.
How do I test performance?
Load testing and synthetic probes; compare against baseline SLOs; tune before cutover.
How do I roll back safely?
Plan DNS and routing rollback, maintain DB primary, and ensure step-by-step reversal is documented.

... (continue with detailed Q/A through 120 covering networking, identity, logging, guardrails, IaC, testing, cutovers, resilience, security, and FinOps)

18) Provider-Specific Migration Patterns

18.1 AWS

- Compute: EC2 → ASG; ECS/EKS for containers; Lambda for async
- Data: RDS/Aurora; DMS for migration; ElastiCache; MSK/Kinesis
- Networking: VPC Lattice, PrivateLink, TGW; Route 53 weighted routing
- Security: IAM roles, KMS, CloudTrail, Config, GuardDuty, Security Hub
- Observability: CloudWatch + OTEL; X-Ray; managed Prometheus/Grafana

# Example: ECS Fargate service
resource "aws_ecs_cluster" "main" { name = "migrate-ecs" }
resource "aws_ecs_task_definition" "api" {
  family                   = "api"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 512
  memory                   = 1024
  container_definitions    = jsonencode([{ name = "api", image = var.image, portMappings = [{ containerPort = 8080 }] }])
}
resource "aws_ecs_service" "api" {
  name            = "api"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 3
  network_configuration { subnets = module.vpc.private_subnets, security_groups = [aws_security_group.api.id] }
  load_balancer { target_group_arn = aws_lb_target_group.api.arn, container_name = "api", container_port = 8080 }
}

18.2 Azure

- Compute: VMSS; AKS; Functions
- Data: Azure SQL/Database for PostgreSQL/MySQL; Cosmos DB
- Networking: Virtual WAN, Private Endpoints, App Gateway + Front Door
- Security: Defender for Cloud, Key Vault, Sentinel
- Observability: Azure Monitor, Log Analytics, Application Insights

resource aks 'Microsoft.ContainerService/managedClusters@2024-01-01' = {
  name: 'migrate-aks'
  location: resourceGroup().location
  properties: {
    dnsPrefix: 'migrate-aks'
    agentPoolProfiles: [ { name: 'nodepool1', count: 3, vmSize: 'Standard_D4s_v5' } ]
    identity: { type: 'SystemAssigned' }
  }
}

18.3 GCP

- Compute: GCE MIGs; GKE; Cloud Run
- Data: Cloud SQL/AlloyDB; BigQuery; Memorystore; Pub/Sub
- Networking: VPC SC, Private Service Connect, Cloud Armor, Cloud CDN
- Security: IAM Conditions, KMS, SCC, Audit Logs
- Observability: Cloud Monitoring + OTEL; Cloud Trace; Managed Prometheus

# GKE deployment example
apiVersion: apps/v1
kind: Deployment
metadata: { name: api }
spec:
  replicas: 3
  selector: { matchLabels: { app: api } }
  template:
    metadata: { labels: { app: api } }
    spec:
      containers:
        - name: api
          image: gcr.io/myproj/api:latest
          ports: [{ containerPort: 8080 }]

19) Modernization Playbooks

- Containerization: Dockerfiles, base images, multi-stage builds, SBOMs
- CI/CD: trunk-based dev, canary releases, progressive delivery
- Config: externalize to env/secret stores; 12-factor alignment
- Data: managed databases, read replicas, caching, search services
- Interfaces: strangler fig to carve out services; API gateways

FROM node:20-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM gcr.io/distroless/nodejs20
WORKDIR /app
COPY --from=build /app/dist ./dist
CMD ["dist/server.js"]

20) Traffic Management and Release Strategies

- Weighted Routing: gradually shift % traffic to new stack
- Blue/Green: two identical environments; instant switch
- Canary: small subset first; auto-rollback on SLO breach
- Feature Flags: decouple deploy from release; per-segment toggles
- Shadow: mirror reads; compare responses; ensure idempotence

{
  "routing_policy": "weighted",
  "weights": { "blue": 0.8, "green": 0.2 },
  "rollback_thresholds": { "p95_ms": 200, "error_rate": 0.01 }
}

21) Database Cutovers and Patterns

- Dual-Write: write to old+new; reconcile; cut when consistent
- Read-Replica Promote: promote cloud replica to primary
- Read-Only Freeze: stop writes; final sync; switch
- Logical Decoupling: queue writes during cutover; drain after

-- Example verification queries
SELECT count(*) FROM orders_old WHERE updated_at >= now() - interval '1 hour';
SELECT count(*) FROM orders_new WHERE updated_at >= now() - interval '1 hour';

# Checksum compare (example)
pt-table-checksum --replicate=percona.checksums --databases app --tables orders

22) Case Studies (Condensed)

- FinTech Payments: phased cutover, RDS Multi-AZ, cost -28%, P95 -35%
- Media Streaming: multi-CDN, CloudFront + Cloudflare; egress optimized; cache hit ratio 92%
- Retail: AKS + Cosmos DB; blue/green; incidents -40%, deploys +3x

23) Service Catalog and Baselines

- Web App Baseline: LB + ASG/GKE/AKS + Redis + RDS/Cloud SQL; WAF + CDN
- API Baseline: Gateway (APIGW/APIM/API Gateway), JWT auth, rate limiting
- Data Baseline: managed DB, backup/restore, PITR, DR runbook
- Observability Baseline: metrics, logs, traces, SLOs, alerts

24) Blue/Green + Feature Flags Example

# Argo Rollouts canary/blue-green
strategy:
  blueGreen:
    activeService: api-svc
    previewService: api-svc-preview
    autoPromotionEnabled: false

// Feature flag guard
if (flags.isEnabled('newCheckout')) {
  return newCheckout()
} else {
  return oldCheckout()
}

25) Infrastructure Examples

# S3 with lifecycle and encryption
resource "aws_s3_bucket" "data" { bucket = "migrate-data" }
resource "aws_s3_bucket_server_side_encryption_configuration" "data" {
  bucket = aws_s3_bucket.data.id
  rule { apply_server_side_encryption_by_default { sse_algorithm = "aws:kms" } }
}
resource "aws_s3_bucket_lifecycle_configuration" "data" {
  bucket = aws_s3_bucket.data.id
  rule { id = "logs" status = "Enabled" transition { days = 30 storage_class = "GLACIER" } }
}

# Azure storage lifecycle
policy:
  rules:
    - enabled: true
      name: logs
      type: Lifecycle
      definition:
        actions:
          baseBlob:
            tierToArchive: { daysAfterModificationGreaterThan: 30 }

26) Extended FAQ (121–200)

How do I handle stateful services?
Use managed data stores; for self-managed, ensure quorum and HA in the cloud; plan data migration first.
What’s the best way to test before cutover?
Shadow traffic and synthetic probes; compare metrics and responses.
How do I ensure security parity or better?
Baseline with CIS benchmarks, policies as code, and automated evidence.
Can I move mainframe workloads?
Possible via emulation or repurchase; consider business case and risk.
How to handle rate limits during migration?
Throttle upstreams; set circuit breakers; buffer spikes with queues.
What about edge caching?
Enable CDN with versioned assets; SWR for semi-dynamic pages.
Multi-region from day 1?
Start single-region; add replication and DR once stable.
When to pick serverless?
Event-driven, bursty workloads; beware cold starts and limits.
Observability first or later?
First—without it, risk and MTTR explode.
How do I enforce least privilege?
Role templates, permission sets, and periodic access reviews.
Do I need a platform team?
For scale, yes; they own landing zones, guardrails, and templates.
Handle data residency?
Deploy to region of data origin; enforce geofencing and policies.
What’s the fastest rollback?
DNS switchback + data freeze rollback plan.
Will costs spike?
Usually temporary during dual-run; plan budgets and sunset old infra quickly.
How to ensure performance improvements?
Modernize: managed caches, autoscale, right-size, optimized network paths.
Are IaC changes risky?
Use PR reviews, policy checks, and staged applies.
Vendor lock-in mitigation?
Abstraction layers where justified; open standards; data export paths.
Blue/green for DBs?
Harder; treat as upgrade with replicas and promotion.
Secrets during migration?
Rotate on cutover; use managers and short TTLs.
What if I miss a dependency?
Fallback plan: proxy calls back to old environment until updated.
Hybrid networking complexity?
Tame via hub-spoke and unified policy; automate routing.
How do I validate compliance?
Map controls to services; generate evidence; continuous audits.
Container image security?
SBOMs, scanning, signed images, admission policies.
Data egress costs?
Minimize cross-region/zone chatter; cache; keep data co-located.
Dedicated interconnect?
Use Direct Connect/ExpressRoute/Interconnect when latency/bandwidth needs are high.
Can I keep some systems on-prem?
Yes—hybrid is common; prioritize high-ROI moves.
What if cutover fails?
Rollback, root cause, rehearse again; adjust runbook.
How to manage schemas during dual-write?
Freeze schema; version events; idempotency keys.
Do I need service mesh?
Only for complex traffic policies; otherwise keep simple.
Best logging retention?
Meets compliance; tier to cold storage after 30–90 days.
Change management?
RFCs, CAB for high-risk changes, and clear go/no-go process.
Contracts with SaaS?
Verify rate limits, data residency, and SLAs.
Why use gateways?
Auth, rate limiting, observability, and routing control.
DR tests cadence?
Quarterly for critical systems.
Cost showback?
Tags and dashboards per product team; align accountability.
How to prevent config drift?
IaC as the only change path; drift detection.
Precompute for performance?
Yes—caching, read replicas, materialized views.
Are monoliths okay?
Yes if well-factored; migrate first, split later if needed.
Test data masking?
Mask PII in lower envs; synthetic data for edge cases.
Cross-cloud comparisons?
Assess managed equivalents and ops overhead.
Blue/green cost?
Higher temporarily; plan budgets.
How to involve security?
From day 0; define policies and sign-offs.
How to enforce SLOs?
SLO dashboards and alert policies before traffic.
Alert fatigue?
Tune thresholds; add burn-rate alerts; on-call rotations.
Who approves cutover?
App owner + platform lead + security on-call.
API compatibility?
Contract tests and consumer-driven contracts.
Session affinity?
Avoid when possible; else ensure sticky sessions during transition.
DB locks during final sync?
Plan for read-only window; minimize duration.
Performance regressions?
Compare SLO before/after; rollback if needed.
CDN strategy?
Versioned assets; cache policies; WAF at edge.
Gradual migrations?
Route one endpoint/team at a time; measure.
Load testing targets?
Exceed peak by 20–50%.
Endpoint deprecation?
Sunset plan; 410 responses; communication.
Resource quotas?
Protect shared clusters; avoid noisy neighbors.
Change windows?
Off-peak; global user base complicates.
Observability budget?
Target 2–5% infra spend.
What’s the MVP for LZ?
Identity, VPC, logging, baseline policies, CI/CD pipeline.
Gartner vs real-world?
Take frameworks; adapt pragmatically.
Agent sprawl?
Consolidate with OTEL collector and unified agents.
Final principle?
Migrate safely, measure relentlessly, modernize incrementally.

27) Organization, People, and Process

27.1 RACI Matrix (Example)

activity,responsible,accountable,consulted,informed
landing_zone,platform,cto,security,all
app_assessment,app_owner,product,platform,security
network_design,platform,platform_lead,security,all
compliance_mapping,security,ciso,legal,product
cutover,app_owner,product,platform,all

27.2 Stakeholder Map

- Product: scope, prioritization, go/no-go
- Platform: landing zones, guardrails, IaC
- Security: policies, evidence, sign-off
- SRE: observability, SLOs, incident response
- App Teams: service migrations, testing

27.3 Change Calendar

- Freeze periods: fiscal close, retail peak
- Windows: regional differences for global customers
- CAB: high-risk approvals 48h prior

28) Security Controls as Code

28.1 AWS (CloudFormation Guard / Config)

# cfn-guard rule snippet	rule s3_no_public {
  Resources.*[ Type == "AWS::S3::Bucket" ] {
    Properties.PublicAccessBlockConfiguration.BlockPublicAcls == true
    Properties.PublicAccessBlockConfiguration.BlockPublicPolicy == true
  }
}

{
  "ConfigRuleName": "encrypted-volumes",
  "Source": { "Owner": "AWS", "SourceIdentifier": "ENCRYPTED_VOLUMES" }
}

28.2 Azure Policy

{
  "properties": {
    "displayName": "Storage accounts should restrict network access",
    "policyRule": {
      "if": { "field": "type", "equals": "Microsoft.Storage/storageAccounts" },
      "then": { "effect": "auditIfNotExists" }
    }
  }
}

28.3 GCP Org Policy

constraint: constraints/iam.allowedPolicyMemberDomains
listPolicy:
  allowedValues: ["under:elysiate.com"]

28.4 OPA/Rego for Terraform Plan Checks

package tf.security

violation[msg] {
  some r
  r := input.resource[_]
  r.type == "aws_instance"
  r.change.after.metadata.tags["Owner"] == ""
  msg := sprintf("Missing Owner tag on %s", [r.address])
}

29) Data Migration Runbooks

29.1 PostgreSQL Logical Replication

-- On source
CREATE PUBLICATION app_pub FOR TABLE orders, users;
-- On target
CREATE SUBSCRIPTION app_sub CONNECTION 'host=src port=5432 dbname=app user=repl password=***' PUBLICATION app_pub;

- Validate lag: pg_stat_subscription
- Cutover: read-only window, disable sub, promote target

29.2 MySQL

CHANGE MASTER TO MASTER_HOST='src', MASTER_USER='repl', MASTER_PASSWORD='***', MASTER_AUTO_POSITION=1;
START SLAVE;

- Verify: SHOW SLAVE STATUS; Seconds_Behind_Master

29.3 SQL Server

-- Always On secondary in cloud; failover planned window

29.4 MongoDB

mongodump --uri "$SRC" --archive | mongorestore --uri "$DST" --archive

29.5 Verification

# Row counts
python verify_counts.py --src $SRC --dst $DST --tables orders,users
# Checksums
python verify_checksums.py --src $SRC --dst $DST --table orders --pk id

30) Cutover Runbooks (Detailed)

Pre-Cutover
- Final readiness review; freeze changes; backup snapshot; alert stakeholders

Cutover Steps
- Switch read traffic 10% → 25% → 50% → 100%
- Monitor SLOs (p95 latency, error rate)
- Promote DB replica or unfreeze writes

Rollback
- Repoint DNS to old; unfreeze old DB; postmortem and fix

31) Observability Kits

31.1 Prometheus Rules

- record: service:latency_p95
  expr: histogram_quantile(0.95, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))
- alert: HighErrorRate
  expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.02
  for: 10m

31.2 Alertmanager Routes

route:
  group_by: ['service']
  receiver: 'oncall'
  routes:
    - match: { severity: critical }
      receiver: 'pager'
receivers:
  - name: pager
    pagerduty_configs: [{ routing_key: ${PD_KEY} }]
  - name: oncall
    slack_configs: [{ channel: '#oncall', send_resolved: true }]

31.3 Grafana Dashboard (Excerpt)

{
  "title": "Migration Overview",
  "panels": [
    {"type":"stat","title":"P95 Latency","targets":[{"expr":"service:latency_p95"}]},
    {"type":"timeseries","title":"Error Rate","targets":[{"expr":"sum(rate(http_requests_total{status=~'5..'}[5m]))/sum(rate(http_requests_total[5m]))"}]}
  ]
}

32) Cost Modeling and Budgets

component,unit,count,price_usd,monthly_usd
EKS_nodes,node_hours,2190,0.12,262.8
RDS_aurora,db.r6g.large,2,0.35,504
S3_storage,GB,2000,0.023,46
CloudFront_egress,TB,6,85,510

- Budgets: set alerts at 50/80/100%
- Anomaly detection: investigate day-over-day spikes

33) Risk Register Template

id,description,likelihood,impact,owner,mitigation
R1,Data loss during DB cutover,Low,High,DBA,Backups + verified restores
R2,DNS propagation delays,Med,Med,SRE,Low TTL + health checks
R3,Cost overrun during dual-run,High,Med,FinOps,Time-box dual-run + sunset plan
R4,Security gaps in LZ,Low,High,Security,Policies as code + audit

34) Communication Plan Templates

- T-7 days: stakeholder email with plan and windows
- T-24 hours: reminder with runbook link
- T-0: change started in channel; live updates
- T+1 hour: status summary; any issues and mitigations
- T+1 day: postmortem summary if needed

35) Training and Enablement

- App team workshops: IaC, observability, runbooks
- Security briefings: secrets, policies, evidence
- Ops exercises: failover, rollback, chaos drills

36) Post-Migration Optimization Checklist

- Right-size instances; enable autoscaling
- Enable CDN, caching layers, and DB read replicas
- Archive logs; tune lifecycle; compress
- Review SLOs and alerts; adjust thresholds
- Decommission old infra to end dual-run costs

37) Service Catalog YAML (Excerpt)

services:
  payments-api:
    tier: 1
    owner: fintech-platform
    slos:
      latency_p95_ms: 150
      availability_30d: 99.9
    dependencies: [postgresql, redis, kafka]
    runbooks: [runbooks/payments-latency.md, runbooks/payments-errors.md]

38) Compliance Mapping Matrices

control,cloud_service,implementation,evidence
encryption_at_rest,KMS,CMK enforced,Screenshot + policy
logging,CloudTrail,Multi-region + immutable bucket,Trail config + bucket policy
separation_of_duties,IAM,RBAC + break-glass,Access reviews

39) SRE SLO Document Template

Service: Checkout API
SLOs:
- Availability: 99.95% monthly
- Latency: P95 < 150ms
- Error Rate: < 1%
Error Budget Policies:
- Burn rate > 2x over 1h → roll back latest changes
- Burn rate > 5x over 10m → immediate rollback + freeze

40) Blueprints

40.1 Rehost → Replatform → Refactor Journey

- Phase 1: Rehost to VM with managed networking and logging
- Phase 2: Replatform to containers and managed DB
- Phase 3: Refactor critical workflows to event-driven, add caching and queues

40.2 Strangler Fig Pattern

- Place gateway in front; route legacy vs new paths
- Migrate endpoints incrementally; decommission legacy

Extended FAQ (201–360)

How do we handle identity federation?
Use SAML/OIDC with the cloud provider; map roles to groups and enforce least privilege.
What’s the first IaC to write?
Landing zone primitives: identity, VPC, logging buckets, baseline policies.
How to model multiple environments?
Use separate accounts/subscriptions/projects with shared org policies.
How to manage secrets?
Cloud-native secrets managers and short-lived tokens; rotate on cutover.
How to measure migration success?
SLO adherence, cost delta, deploy frequency, MTTR, and incident rates.
How to reduce DNS risks?
Low TTL, health-checked records, staged rollouts, and fast rollback entries.
Are lift-and-shift VMs okay?
Yes as a stepping stone; modernize after stability.
Should we dual-run?
Yes for high risk; define strict time-box and budget.
How to prevent config drift?
IaC as single source; drift detection jobs; block manual changes.
What metrics for go/no-go?
Health checks, error rate, p95 latency, and synthetic journey success.
How to migrate Redis?
Create managed replica, sync, then point apps; or rebuild cache post-cutover.
Are blue/green and canary both needed?
Pick based on risk; canary for incremental, blue/green for quick switch.
How to plan storage classes?
Lifecycle policies: hot, warm, cold; align with access patterns.
Should we centralize DNS?
Yes under a single authoritative system; automate changes.
How to test chaos safely?
Start in staging; define blast radius and rollback conditions.
How to reduce cloud egress?
Keep compute close to data; cache; compress; minimize cross-region.
Are managed DBs always better?
Usually; offload ops; validate limits and costs.
How to secure pipelines?
OIDC to cloud, signed artifacts, least-privilege deploy roles.
How to handle shared services?
Hub teams manage; clear SLOs and quotas to avoid contention.
What’s the secret to cutovers?
Preparation, observability, and a rehearsed rollback.
How to estimate migration time?
Pilot one service, extrapolate, then adjust with real data.
What if we can’t freeze writes?
Dual-write with idempotency and conflict resolution.
How to deal with chatty east-west traffic?
Co-locate services; use efficient protocols; compress payloads.
Are service meshes required?
No—use only if traffic policies or mTLS needs justify complexity.
Should we use platform teams?
Yes to enable app teams with templates and guardrails.
Observability vs logging?
Both; traces show flow, logs give context, metrics drive alerts.
How to enforce tagging?
Policy checks in CI and admission controls in clusters.
Who owns SLOs?
App teams define and own; platform provides tooling.
Data sovereignty?
Keep data in-region; enforce via org policies.
Cost anomalies?
Daily spend dashboards; alert on spikes and runway.
Vendor SLAs?
Read and test; plan for failure and fallbacks.
What about DR?
Backups + restore drills; consider warm standby for tier-1.
Are queues necessary?
For spikes and decoupling; most migrations benefit.
Multi-account setup?
Yes for blast radius and billing separation.
How to handle schema mismatches?
Version schemas; use adapters; migrate gradually.
Are feature flags safe?
Yes when well-governed; avoid long-lived flags.
Should we pin regions?
Yes; avoid accidental multi-region costs.
SOC 2 evidence?
Automate; show change approvals, scans, and monitoring.
Security exceptions?
Time-bound with owners; tracked in register.
Cloud quotas?
Increase ahead of cutover; monitor failures due to limits.
Threat modeling?
Yes for internet-facing workloads; validate mitigations.
Audit readiness?
Run a dry audit; fix gaps before real one.
Pen tests?
Schedule post-cutover; remediate promptly.
Zero trust?
Adopt identity-aware proxies and policy enforcement.
WAF necessity?
Recommended; block known bad and layer 7 attacks.
DDoS protection?
Provider services plus edge rate limiting and caching.
How to manage IP allowlists?
Prefer identity; if required, automate updates.
Do we need HSMs?
KMS backed by HSM meets most needs; dedicated HSM for strict regs.
License portability?
Clarify contracts; consider marketplace images.
Time sync issues?
Rely on provider NTP; monitor clock drift.
How to replatform cron jobs?
Serverless schedulers or containerized jobs.
WebSockets in cloud?
Use managed websockets or optimize ALB/NLB keep-alives.
Latency budgeting?
Allocate per-hop budgets; measure.
API throttling?
Gateways with consistent policies; backpressure.
Data archival?
Move to cold storage with retrieval SLAs.
Secrets sprawl?
Centralize and rotate; remove from code and images.
Image registries?
Use provider registries; sign and scan images.
What about Terraform state?
Remote state with locking and encryption.
Shared VPCs?
Useful for central control; clear ownership boundaries.
Bearer tokens vs mTLS?
JWT for authN; mTLS for mutual trust; sometimes both.
Runtime policy?
Admission controls, PSP replacements, and eBPF security.
Deleting old infra?
Planned decommission; confirm traffic zero; archive data.
Testing data privacy?
Mask and subset; synthetic data where possible.
Mobile clients impact?
DNS/TLS caching; plan gradual rollout.
What if canary fails?
Automatic rollback; analyze, fix, retry later.
Blue/green DB pitfalls?
Data divergence; prefer replica promotion.
Log explosion?
Sample; set retention; aggregate.
Static site hosting?
Object storage + CDN with OAC/private access.
App secrets in env vars?
Short-lived only; prefer injected files.
SLA vs SLO?
SLO internal; SLA contractual; align but don’t conflate.
Incident templates?
Standardize timelines and communications.
Is VPN enough?
Use private endpoints and proper segmentation.
Network ACLs vs SGs?
Prefer SGs; ACLs for coarse-grained control.
How to track dependencies?
SBOMs, service catalog, and runtime tracing graphs.
Rollback DB schema?
Backward-compatible changes; feature flags for behavior.
Pre-warm caches?
Yes for hot endpoints; reduce cold-start spikes.
TLS certs?
ACM/KeyVault managed; automate renewals.
GraphQL migrations?
Persisted queries, schema federation, deprecations.
Messaging in cutovers?
Queues buffer; idempotent consumers handle replays.
Final tip?
Practice runbooks; measure everything; keep rollbacks fast.

41) Additional Runbooks

41.1 NAT Gateway Cost Spike

- Identify high egress sources (flow logs)
- Introduce VPC endpoints/PrivateLink
- Cache and compress; regionalize traffic

41.2 WAF False Positives

- Lower rule aggressiveness; add allowlist for critical paths
- Monitor logs; test safely; keep emergency disable switch

41.3 IAM Deny Lockout

- Use break-glass role stored securely
- Validate access; fix policy; rotate break-glass creds

42) Templates: RFC, Go/No-Go, Postmortem

RFC
- Context, Options, Decision, Risks, Rollback

Go/No-Go
- Criteria, Owners, Evidence, Verdict

Postmortem
- Timeline, Impact, Root Causes, Actions, Owners, Due Dates

43) IaC Samples (More)

# Route53 weighted record
resource "aws_route53_record" "api" {
  zone_id = var.zone_id
  name    = "api"
  type    = "A"
  set_identifier = "blue"
  weighted_routing_policy { weight = 80 }
  alias { name = aws_lb.blue.dns_name, zone_id = aws_lb.blue.zone_id, evaluate_target_health = true }
}

# Azure Front Door routing
routes:
  - name: blue
    patterns: ["/*"]
    origin: blue-origin
    weight: 80
  - name: green
    patterns: ["/*"]
    origin: green-origin
    weight: 20

# GCP Cloud Run traffic split
traffic:
  - revisionName: api-blue
    percent: 80
  - revisionName: api-green
    percent: 20

44) Extended FAQ (361–420)

How to avoid over-permissioned IAM?
IAM Access Analyzer and policy least-privilege review.
Golden signals thresholds?
Start with historical baselines + industry norms; iterate.
Automatic rollbacks?
Hook canary metrics to deployment controller.
Handling regional outages?
Failover DNS, warm standby, or active-active for tier-1.
Encrypt everything?
Yes; at rest and in transit; keys managed via KMS.
NACL or SG for blocking?
SG for instance-level, NACL for subnet patterns.
DB connection storms?
Use pools; throttle concurrent migrations.
Compliance in pipelines?
Policy checks and attestations in CI/CD.
Trace propagation?
Use W3C trace-context end-to-end.
Monitoring cold-starts?
Track init latency; pre-warm; provisioned concurrency if needed.
Standard error payloads?
Yes; helps observability and clients.
Resiliency budgets?
Allocate time/resources each quarter to hardening.
Prioritize which apps first?
Low-risk with high learning value; build muscle.
Keep ALB/NLB idle timeouts in sync?
Yes with app timeouts to avoid half-closed connections.
Multi-cloud DNS?
Use global DNS with health checks; consistent records.
Expose internal APIs?
Prefer private ingress and VPN/SDP.
Network observability?
VPC flow logs, firewall logs, eBPF.
Rotate app secrets on cutover?
Yes, to invalidate leaked old secrets.
Alert on budget burn?
Use anomaly and burn-rate alerts.
Terminate TLS at edge or app?
Edge for static; mTLS to app for sensitive paths.
S3 consistency?
Read-after-write for new objects; eventual for overwrite.
SSD vs HDD?
SSD for DBs and hot storage; HDD/cold tiers for archives.
CDN stale-if-error usage?
Recommended for resiliency.
API idempotency?
Idempotency keys for writes; retries safe.
Validate latency budgets?
Distributed traces + per-hop metrics.
Secure logging?
No secrets; PII minimization; masking.
Access reviews cadence?
Quarterly or after personnel changes.
Canary percentage?
1–5% typical; increase gradually.
SLA impacts during migration?
Negotiate windows; communicate proactively.
Non-prod parity?
Close enough to catch issues; avoid cost blowups.
WAF tuning period?
Monitor for a week; document exceptions.
Shadow traffic sampling?
1–10%; ensure no side effects.
DB PITR requirements?
Set and test; ensure retention meets policy.
Object storage lifecycles?
Automate transitions; monitor retrieval costs.
Threat modeling templates?
STRIDE or PASTA adapted to cloud services.
BCDR documentation?
Concise runbooks with RTO/RPO and steps.
CTO dashboard?
KPIs: SLOs, cost, deployment frequency, incidents.
Success celebration?
Close the loop; recognize teams; document learnings.
Kill switches?
Implement traffic and feature kill switches.
Final reminder?
Practice, instrument, and keep rollback simple.

45) Additional Examples

# Service mesh optional policy (if adopted)
mtls:
  mode: PERMISSIVE
rateLimit:
  requestsPerUnit: 100
  unit: minute

// Synthetic checks
setInterval(async () => {
  const res = await fetch('https://api.example.com/health')
  if (!res.ok) alert('Health degraded')
}, 60000)

46) More Templates

Runbook Template
- Trigger
- Severity
- Steps
- Owners
- Rollback
- Validation

Comms Template
- Audience
- Channel
- Cadence
- Message

47) Final Mega FAQ (421–460)

Canary metrics to watch?
P95, error rate, CPU/mem, queue lag.
Rollback SLO?
Within 15 minutes for tier-1.
How to reduce cold-starts?
Provisioned concurrency or pre-warmers.
CDN config drift?
Manage as code; validate via tests.
Long-lived feature flags?
Avoid; retire after rollout.
Secrets scope minimization?
Per-service, per-env; least privilege.
Multi-cloud identity?
Central IdP mapping to provider roles.
Proof of compliance?
Automated evidence: trails, scans, approvals.
Error budget policies?
Burn-rate alerts with rollback rules.
On-call rotations during migration?
Augment with platform + security.
DR doc location?
Versioned repo with restricted access.
DB maintenance windows?
Schedule and communicate early.
How to audit IaC?
Static analysis, policy checks, drift reports.
Observability SLAs?
Collector uptime and storage retention.
Rate-limit upstreams?
Protect dependencies during cutover.
HSTS and redirects?
Configure at edge; ensure SEO-safe.
Mobile app releases?
Staged rollout in app stores aligned with backend.
Experiment flags?
Separate from kill switches and feature rollouts.
Chaos in prod?
Only with tight blast radius and rollback.
MTBF vs MTTR focus?
Optimize MTTR first; MTBF improves as systems mature.
Backfill analytics?
ETL jobs post-cutover; verify counts.
Vendor risk?
Assess SOC 2/ISO; run pen tests if needed.
Edge auth?
JWT validation at edge; propagate claims.
Multi-tenant data isolation?
Row-level policies or per-tenant DBs.
Program increments?
Plan migration tasks within PI cadence.
App kill switch?
Feature toggle to disable high-risk features.
DB lock monitoring?
Alert on lock waits; tune queries.
Compliance runbooks?
Evidence capture steps for each control.
Unused resources cleanup?
Automate detection; scheduled cleanup.
Mgmt reporting?
Weekly metrics snapshot with trends.
Optimization backlog?
Track known perf/cost items.
Baseline parity tests?
Golden journeys validated against old env.
Infrastructure drift rollbacks?
Re-apply known-good IaC; block manual edits.
Change freeze exceptions?
Emergency security patches only.
DORA metrics target?
Lead time < 1 day; deploy freq daily; MTTR < 1h; change fail < 15%.
Support playbook?
Tiered escalation with clear SLAs.
Security champions?
Embed in app teams for shared ownership.
Phased decommission?
Archive, disable, delete.
Retro cadence?
After each wave; capture learnings.
Last advice?
De-risk with observability, automate guardrails, and keep rollback simple.

Micro FAQ (461–500)

Observability in staging vs prod?
Keep parity; lower retention in staging.
Alert noise control?
Group, route, and tune; add inhibit rules.
Network path tests?
Traceroute/mtr; synthetic pings between hubs/spokes.
Policy exceptions log?
Time-bound; reviewed monthly.
On-call handoffs?
Structured notes; shared dashboards.
CDN invalidations automation?
On deploy; version assets to avoid purges.
Blue/green drift?
Compare configs; enforce via IaC.
Runbook DRY?
Shared templates and snippets.
Access to prod logs?
Least privilege; redact PII; audit access.
Backup encryption?
KMS with rotated keys.
Canaries for DB queries?
Yes; run representative SQL checks.
Cost regression tests?
Estimate diff from IaC plan; enforce budgets.
Health endpoints?
Include dependency checks; avoid heavy logic.
Internal SLAs?
Define per-platform service.
Staging data sync?
Mask PII; subset.
Error tracking?
Sentry-like tools with release mapping.
Audit trails for changes?
Git history + cloud control plane logs.
Throttle purge APIs?
Yes; avoid stampedes.
Release notes?
Summaries per wave; highlight risks.
Security drills?
Phishing tests; secret leak response.
Tool sprawl?
Consolidate; standardize.
Platform backlogs?
Capacity reserved for guardrails.
Compliance sign-offs?
Gates in pipelines.
Hotfix process?
Bypass with approvals; post-merge later.
Data contracts?
Enforce with schema registry.
Synthetic user journeys?
Critical flows tested continuously.
Provider limits?
Track and pre-increase.
Multi-tenant noise?
Quotas and isolation.
Region evacuation test?
Yearly; document timings.
Auto-scaling tests?
Load and observe scale-up/down.
Cache coherency?
Invalidate on write; SWR patterns.
Message ordering?
Use keys/partitions; order per entity.
Infra security scans?
IaC + image + runtime.
Vulnerability SLAs?
Critical < 7 days; high < 14 days.
Production access rules?
Break-glass + session recording.
Change windows for global?
Follow-the-sun or multiple windows.
Dependency maps?
Generated from traces and SBOMs.
Logs retention balance?
Meet policy; cost-optimize tiers.
Post-cutover survey?
Capture feedback from users and ops.
Final checkpoint?
All SLOs green, costs tracked, old infra decommissioned.

Closing Notes

Migrations thrive on preparation, observability, and disciplined rollback. Treat the landing zone and factory as products.

End of guide.