AWS Architecture Patterns: Well-Architected in Practice (2025)

Oct 26, 2025
awsarchitecturewell-architectedpatterns
0

Use AWS primitives to build resilient, secure, and cost-efficient systems. This guide provides practical patterns mapped to the Well-Architected pillars.

Executive summary

  • Decouple with queues/streams; apply least privilege and guardrails
  • Design for failure: multi-AZ, retries with jitter, idempotency
  • Measure: SLOs, error budgets, cost per request

Reference patterns

  • Serverless APIs (API Gateway + Lambda + DynamoDB)
  • Container platforms (ECS/Fargate/EKS) with ALB/NLB
  • Data lakes (S3 + Glue + Athena + Lake Formation)
  • Event-driven (SNS/SQS/EventBridge/Kinesis)

Security

  • IAM boundaries; SCPs; centralized logging; secret rotation (Secrets Manager)

Reliability

  • Multi-AZ; backups and PITR; health checks; chaos drills

Performance

  • Caching (CloudFront/Redis), async processing, right instance families

Cost

  • SP/RI/Graviton; lifecycle policies; cost allocation tags; budgets/alerts

FAQ

Q: ECS or EKS?
A: ECS for simplicity on AWS; EKS for portability/ecosystem needs.


Executive Summary

This guide distills AWS architecture patterns aligned to the Well-Architected Framework. It includes production-ready reference architectures, IaC, security baselines, observability, DR, cost/sustainability strategies, and runbooks.


Well-Architected Pillars (Actionable Controls)

Operational Excellence

runbooks:
  - name: "ALB 5xx surge"
    steps: ["check recent deploys", "rollback", "scale ASG", "inspect app logs"]
change_management:
  approvals: 1
  deployment: blue_green

Security

security_baseline:
  kms: required
  secrets_manager: required
  guardduty: enabled
  security_hub: enabled
  config: ruleset: aws-foundational
  inspector: v2: enabled
  sso: enforced

Reliability

reliability:
  multi_az: true
  multi_region: critical_services
  health_checks: route53
  autoscaling: target_tracking

Performance Efficiency

performance:
  graviton: preferred
  elb: alb for http, nlb for tcp
  caching: cloudfront+elasticache

Cost Optimization

cost:
  compute_optimizer: enabled
  savings_plans: 1y partial
  s3_lifecycle: glacier_deep_archive after 180d

Sustainability

sustainability:
  rightsize: continuous
  idle_shutdown: nonprod_nights
  managed_services: preferred

Multi-Account Landing Zone (Control Tower)

o u:
  - security
  - infrastructure
  - workloads
  - sandbox
scps:
  - deny_root
  - deny_unapproved_regions
  - deny_iam_star_actions
identity:
  sso: iam_identity_center
  permission_sets: [admin, poweruser, read_only]
{
  "Version": "2012-10-17",
  "Statement": [
    {"Sid": "DenyRoot", "Effect": "Deny", "Action": "*", "Resource": "*", "Condition": {"StringLike": {"aws:PrincipalArn": "arn:aws:iam::*:root"}}}
  ]
}

VPC Networking Patterns

graph TD
Hub((Hub VPC)) --- TGW[Transit Gateway]
Spoke1((Spoke VPC 1)) --- TGW
Spoke2((Spoke VPC 2)) --- TGW
PrivateLink[Interface Endpoints] --- Spoke1
# Terraform VPC module
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  name    = "workloads"
  cidr    = "10.0.0.0/16"
  azs     = ["us-east-1a","us-east-1b"]
  private_subnets = ["10.0.1.0/24","10.0.2.0/24"]
  public_subnets  = ["10.0.101.0/24","10.0.102.0/24"]
  enable_nat_gateway = true
}
# Interface Endpoint (CloudFormation)
Type: AWS::EC2::VPCEndpoint
Properties:
  ServiceName: com.amazonaws.us-east-1.ssm
  VpcId: !Ref VpcId
  SubnetIds: [!Ref PrivateSubnet1, !Ref PrivateSubnet2]
  SecurityGroupIds: [!Ref EndpointSG]

Security Baseline (KMS, Secrets, GuardDuty, WAF)

resource "aws_kms_key" "default" { enable_key_rotation = true }
resource "aws_secretsmanager_secret" "app" { name = "app/db" }
resource "aws_guardduty_detector" "main" { enable = true }
resource "aws_wafv2_web_acl" "api" { name = "api-waf" scope = "REGIONAL" default_action { allow {} } }
apiVersion: v1
kind: ConfigMap
metadata: { name: cni-config, namespace: kube-system }
data:
  aws-node: |
    enableNetworkPolicy: true

Reference Architectures

Serverless Web + API

graph LR
CF[CloudFront] --> APIGW[API Gateway]
APIGW --> Lambda
Lambda --> Dynamo[DynamoDB]
S3[S3 Static Site] --> CF
# SAM/CloudFormation snippet for Lambda + API

Containers on ECS/Fargate

resource "aws_ecs_cluster" "main" { name = "apps" }
resource "aws_ecs_service" "web" {
  cluster = aws_ecs_cluster.main.id
  launch_type = "FARGATE"
  desired_count = 3
}

EKS + Ingress + ALB

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
spec:
  rules:
    - http:
        paths:
          - path: /
            pathType: Prefix
            backend: { service: { name: web, port: { number: 80 } } }

Data Lake (S3 + Glue + Athena + Lake Formation)

resource "aws_s3_bucket" "lake" { bucket = "company-lake" }
resource "aws_glue_catalog_database" "db" { name = "lake_db" }

Event-Driven (EventBridge / SQS / SNS)

Type: AWS::Events::Rule
Properties:
  EventPattern: { source: ["app.orders"] }
  Targets: [{ Arn: !GetAtt Queue.Arn, Id: q1 }]

Streaming (Kinesis / MSK)

resource "aws_kinesis_stream" "events" { name = "events" shard_count = 2 }

Web Apps (ALB/NLB + ASG)

resource "aws_autoscaling_group" "web" {
  desired_capacity = 4
  max_size = 12
  min_size = 2
}

IaC: CloudFormation, CDK, Terraform

// CDK
new s3.Bucket(this, 'Assets', { encryption: s3.BucketEncryption.S3_MANAGED })
# Terraform module invocation
module "alb" { source = "terraform-aws-modules/alb/aws" name = "web" }

CI/CD (CodePipeline, GitHub Actions)

name: infra-ci
on: [push]
jobs:
  tf:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init && terraform validate && terraform plan
# CodePipeline (YAML-like pseudo)
stages:
  - source: github
  - build: codebuild
  - deploy: cloudformation

Observability (CloudWatch / OTEL)

{
  "widgets": [
    {"type": "metric", "properties": { "metrics": [["AWS/ELB","HTTPCode_Target_5XX_Count","LoadBalancer","alb" ]], "stat": "Sum", "period": 300 }}
  ]
}
receivers:
  otlp: { protocols: { http: {}, grpc: {} } }
exporters:
  awsemf: { namespace: "EKS/Apps" }
service:
  pipelines:
    metrics: { receivers: [otlp], exporters: [awsemf] }

Backup and DR

# AWS Backup plan
plan:
  rules:
    - name: daily-backup
      target_vault_name: default
      schedule_expression: cron(0 5 * * ? *)
      lifecycle: { delete_after_days: 30 }
DR Strategy
- RPO: 15m; RTO: 1h
- Cross-region replicas (RDS/DynamoDB/S3)
- Route 53 failover health checks

Cost Optimization

- Rightsize with Compute Optimizer; adopt Graviton
- Savings Plans/Reserved Instances; Spot for flexible workloads
- S3 lifecycle: IA/Glacier tiers; compress objects
service,current_usd_month,optimized_usd_month,delta
EC2,12000,9000,-3000
RDS,7000,5900,-1100
S3,1800,1200,-600

Performance Tuning

Aurora
- Reader endpoints; parallel query; serverless v2 for spiky

DynamoDB
- On-demand for unknown; provisioned + auto scaling for steady; GSIs

ElastiCache
- Lazy loading; TTLs; cluster mode

Sustainability Practices

- Prefer managed and serverless services
- Decommission idle/dev nightly
- Optimize storage tiers and data retention

Deployments: Blue/Green/Canary

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    blueGreen:
      activeService: web
      previewService: web-preview
      autoPromotionEnabled: false

Runbooks and SOPs

ALB 5xx Surge
- Check recent deploys; roll back if necessary
- Increase ASG desired; inspect target health

RDS Connection Saturation
- Add reader; optimize pool settings; increase instance size temporarily

S3 403s
- Check bucket policy and IAM changes; AWS Config timeline

JSON-LD



Call to Action

Need help designing or reviewing your AWS architecture? We build secure, cost-efficient, and resilient platforms aligned to Well-Architected.


Extended FAQ (1–160)

  1. Single vs multi-account?
    Multi-account with Control Tower for isolation and guardrails.

  2. Cross-region replication?
    Enable for critical data (S3 CRR, DynamoDB global tables, Aurora global DB).

  3. Public vs private subnets?
    Expose only ALB/NLB/EIPs; keep apps private with NAT.

  4. ALB vs NLB?
    ALB for HTTP routing; NLB for TCP/UDP and extreme throughput.

  5. ASG scaling policy?
    Target tracking on CPU/requests; cooldowns.

... (add 150+ practical Q/A on networking, security, compute, storage, data, observability, DR, and cost)


Identity and IAM Patterns

{
  "Version": "2012-10-17",
  "Statement": [
    { "Sid": "DenyConsoleWithoutMFA", "Effect": "Deny", "Action": "*", "Resource": "*", "Condition": { "BoolIfExists": { "aws:MultiFactorAuthPresent": "false" } } }
  ]
}
{
  "Version": "2012-10-17",
  "Statement": [
    { "Sid": "BoundaryNoStar", "Effect": "Deny", "Action": "iam:*", "Resource": "*", "Condition": { "StringLike": { "iam:PermissionsBoundary": "arn:aws:iam::*:policy/*-boundary" } } }
  ]
}
permission_sets:
  - name: admin
    policies: [AdministratorAccess]
  - name: read_only
    policies: [ReadOnlyAccess]

S3 Security

{
  "Version": "2012-10-17",
  "Statement": [
    { "Sid": "DenyPublic", "Effect": "Deny", "Principal": "*", "Action": "s3:*", "Resource": ["arn:aws:s3:::bucket","arn:aws:s3:::bucket/*"], "Condition": { "Bool": { "aws:SecureTransport": "false" } } }
  ]
}
resource "aws_s3_bucket_public_access_block" "this" {
  bucket                  = aws_s3_bucket.bucket.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}
# S3 Access Point policy (restrict VPC access)

RDS/Aurora HA and Failover

resource "aws_rds_cluster" "aurora" {
  engine               = "aurora-postgresql"
  master_username      = "app"
  master_password      = random_password.db.result
  backup_retention_period = 7
  preferred_backup_window = "03:00-04:00"
}

resource "aws_rds_cluster_instance" "aurora_instances" {
  count                = 2
  cluster_identifier   = aws_rds_cluster.aurora.id
  instance_class       = "db.r6g.large"
  engine               = aws_rds_cluster.aurora.engine
  publicly_accessible  = false
}
- Use reader endpoints for read scaling
- Enable Performance Insights
- Configure failover priority for multi-AZ

DynamoDB Design

- Choose partition key with high cardinality
- Use sort keys for range queries
- GSIs for alternate access patterns
- TTL for item expiry
- Streams for change data capture (Lambda/Kinesis)
resource "aws_dynamodb_table" "orders" {
  name         = "orders"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "orderId"
  attribute { name = "orderId" type = "S" }
  ttl { attribute_name = "ttl" enabled = true }
  stream_enabled = true
  stream_view_type = "NEW_AND_OLD_IMAGES"
}

Messaging (SNS/SQS)

resource "aws_sqs_queue" "events" {
  name                      = "events"
  redrive_policy            = jsonencode({ deadLetterTargetArn = aws_sqs_queue.dlq.arn, maxReceiveCount = 5 })
  fifo_queue                = false
}

resource "aws_sns_topic" "notifications" { name = "notifications" }
resource "aws_sns_topic_subscription" "sub" { topic_arn = aws_sns_topic.notifications.arn protocol = "sqs" endpoint = aws_sqs_queue.events.arn }

Streaming (Kinesis / MSK)

resource "aws_kinesis_stream" "clicks" { name = "clicks" shard_count = 2 retention_period = 48 }
- For MSK, prefer IAM auth or mTLS; isolate in private subnets

ECS/EKS Security

# ECS task role
resource "aws_iam_role" "task" { assume_role_policy = data.aws_iam_policy_document.ecs_assume.json }
# EKS IRSA (IAM Roles for Service Accounts)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123:role/s3-reader
# NetworkPolicy (Calico/Cilium)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
spec:
  podSelector: { matchLabels: { app: web } }
  policyTypes: [Ingress, Egress]
  ingress: [{ from: [{ podSelector: { matchLabels: { app: api } } }] }]
  egress: []

Lambda Patterns

DeadLetterConfig:
  TargetArn: arn:aws:sqs:us-east-1:123:lambda-dlq
RetryPolicy:
  MaximumRetryAttempts: 2
def handler(event, context):
    idempotency_key = event.get('requestId')
    if seen(idempotency_key): return ok()
    save(idempotency_key)
    # process

CloudFront and WAF

resource "aws_cloudfront_distribution" "cdn" {
  enabled = true
  default_cache_behavior { viewer_protocol_policy = "redirect-to-https" allowed_methods = ["GET","HEAD"] }
  restrictions { geo_restriction { restriction_type = "none" } }
  viewer_certificate { cloudfront_default_certificate = true }
}

Route 53 Patterns

resource "aws_route53_record" "weighted" { name = "api" type = "A" set_identifier = "blue" weighted_routing_policy { weight = 80 } }
resource "aws_route53_health_check" "api" { type = "HTTPS" resource_path = "/health" fqdn = "api.company.com" }

Multi-Region Patterns

Active/Passive
- Primary region handles writes; replicate to secondary
- Route 53 failover on health check

Active/Active
- Global tables (DynamoDB), Aurora Global Database
- Conflict resolution strategy

CDK Examples

import * as s3 from 'aws-cdk-lib/aws-s3'
const bucket = new s3.Bucket(this, 'Assets', { blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL, encryption: s3.BucketEncryption.S3_MANAGED })

CloudWatch Alarms and Dashboards

resource "aws_cloudwatch_metric_alarm" "alb_5xx" {
  alarm_name          = "alb-5xx"
  namespace           = "AWS/ApplicationELB"
  metric_name         = "HTTPCode_Target_5XX_Count"
  statistic           = "Sum"
  period              = 60
  evaluation_periods  = 5
  threshold           = 100
  comparison_operator = "GreaterThanThreshold"
}
{ "widgets": [ { "type": "metric", "properties": { "metrics": [["AWS/ECS","CPUUtilization","ClusterName","apps"]] } } ] }

OTEL on EKS/ECS

receivers:
  otlp: { protocols: { grpc: {}, http: {} } }
exporters:
  awsemf: { namespace: EKS/Apps }
  awsxray: {}
service:
  pipelines:
    traces: { receivers: [otlp], exporters: [awsxray] }
    metrics: { receivers: [otlp], exporters: [awsemf] }

Backup/Restore Scripts

aws rds create-db-snapshot --db-instance-identifier app-db --db-snapshot-identifier app-db-$(date +%F)
aws s3 sync s3://bucket s3://bucket-backup --storage-class GLACIER

DR Playbooks

RDS Primary Down
- Promote read replica; re-point endpoints; increase capacity

EKS Cluster Impaired
- Spin secondary cluster via IaC; restore stateful workloads from backups

Cost Dashboards (CUR + Athena)

SELECT line_item_product_code, SUM(line_item_unblended_cost) AS cost
FROM cur
WHERE bill_billing_period_start_date >= date_trunc('month', current_date)
GROUP BY 1 ORDER BY 2 DESC;

Compute Optimizer and Savings Plans

- Review rightsizing recommendations weekly
- Target 60–80% coverage with Savings Plans; avoid over-commit

Sustainability Dashboards

- Track idle resources; emissions estimates (third-party tooling)
- Prefer Graviton and managed services

Detailed Runbooks

NAT Gateway Cost Spike
- Check egress and VPC endpoints; add Interface/Gateway endpoints; reduce cross-AZ traffic

S3 5xx Increase
- Verify bucket throttling, retry policy; consider request rate patterns

Extended FAQ (161–280)

  1. When to use Transit Gateway vs peering?
    TGW for hub-spoke many VPCs; peering for few.

  2. PrivateLink vs VPC peering?
    PrivateLink for services; peering for full routing.

  3. IAM least privilege?
    Boundary policies and access analyzer.

  4. KMS key policy tips?
    Use key admins and user separation.

  5. RDS Multi-AZ vs read replica?
    Multi-AZ for HA; replicas for reads.

  6. Aurora Serverless v2?
    Great for spiky workloads.

  7. DynamoDB hot partitions?
    Randomize keys; adapt patterns.

  8. SQS FIFO?
    Use for ordered processing; throughput limits.

  9. Kinesis vs MSK?
    Kinesis managed; MSK for Kafka compatibility.

  10. ECS vs EKS?
    ECS simpler; EKS for Kubernetes ecosystems.

  11. IRSA?
    IAM roles bound to service accounts.

  12. Lambda cold starts?
    Provisioned concurrency.

  13. CloudFront compression?
    Enable; cache policies.

  14. Route 53 latency routing?
    Direct users to nearest region.

  15. Global Aurora vs DynamoDB global tables?
    DB engine vs NoSQL trade-offs.

  16. CDK vs Terraform?
    Org standards; both fine.

  17. CloudWatch cost?
    Metric filters; log retention.

  18. X-Ray sampling?
    Tailor for cost; error-based full sampling.

  19. Backup vault locks?
    Enable for immutability.

  20. Savings Plans vs RIs?
    SPs more flexible.

  21. Spot interruptions?
    Handle with checkpoints and graceful drain.

  22. Graviton readiness?
    Rebuild images; benchmark.

  23. S3 strong consistency?
    Yes; design accordingly.

  24. ALB target groups?
    Separate health checks per service.

  25. NLB TLS termination?
    Use for TLS pass-through.

  26. WAF managed rules?
    Enable; tune exceptions.

  27. Inspector coverage?
    Ensure agents on EC2/ECS.

  28. Config rules drift?
    Remediate with SSM Automation.

  29. EKS upgrades?
    Blue/green nodes; surge upgrades.

  30. Fargate vs EC2 on ECS?
    Fargate for ops simplicity.

  31. CloudFormation stack sets?
    Multi-account deployments.

  32. S3 inventory?
    Enable and scan for public objects.

  33. Lambda DLQ?
    SQS or SNS; alert on growth.

  34. Glue ETL costs?
    DPU tuning; job bookmarks.

  35. Athena performance?
    Partition and compress; CTAS.

  36. ECR scanning?
    Enable enhanced scanning.

  37. VPC Lattice?
    Consider for service-to-service across VPCs.

  38. Security Hub standards?
    Enable CIS/FSBP; fix criticals.

  39. Final readiness?
    SLOs healthy; costs tracked; DR tested.


CloudTrail, AWS Config, and Security Hub (Organization)

resource "aws_cloudtrail" "org" {
  name                          = "org-trail"
  s3_bucket_name                = aws_s3_bucket.trail.id
  include_global_service_events = true
  is_multi_region_trail         = true
  is_organization_trail         = true
}

resource "aws_config_configuration_recorder" "rec" { name = "default" role_arn = aws_iam_role.config.arn recording_group { all_supported = true include_global_resource_types = true } }
resource "aws_config_delivery_channel" "chan" { s3_bucket_name = aws_s3_bucket.config.id }

resource "aws_securityhub_account" "hub" { enable_default_standards = true }
resource "aws_securityhub_standards_subscription" "cis" { standards_arn = "arn:aws:securityhub:::standards/aws-foundational-security-best-practices/v/1.0.0" }

KMS Key Policies and Grants

{
  "Version": "2012-10-17",
  "Statement": [
    {"Sid": "KeyAdmins","Effect":"Allow","Principal":{"AWS":["arn:aws:iam::123:role/kms-admins"]},"Action":["kms:*"],"Resource":"*"},
    {"Sid": "UseKey","Effect":"Allow","Principal":{"AWS":["arn:aws:iam::123:role/app-ec2","arn:aws:iam::123:role/app-lambda"]},"Action":["kms:Encrypt","kms:Decrypt","kms:GenerateDataKey*"],"Resource":"*"}
  ]
}
resource "aws_kms_grant" "ec2" { key_id = aws_kms_key.app.key_id grantee_principal = aws_iam_role.app_ec2.arn operations = ["Encrypt","Decrypt","GenerateDataKey"] }

SSM Parameter Store and Secrets Manager Patterns

import { SSMClient, GetParameterCommand } from '@aws-sdk/client-ssm'
const ssm = new SSMClient({})
const param = await ssm.send(new GetParameterCommand({ Name: "/app/db/uri", WithDecryption: true }))
import { SecretsManagerClient, GetSecretValueCommand } from '@aws-sdk/client-secrets-manager'
const sm = new SecretsManagerClient({})
const secret = await sm.send(new GetSecretValueCommand({ SecretId: "app/db" }))

Lake Formation Permissions and Row-Level Security

-- Grant Lake Formation table access
GRANT SELECT ON TABLE lake_db.sales TO ROLE 'analyst';
GRANT SELECT(filters: region IN ('us','eu')) ON TABLE lake_db.sales TO ROLE 'regional_analyst';
- Use LF-Tags for tag-based access control
- Register S3 locations; grant data location permissions

Glue Jobs and Workflows

# Glue ETL job (PySpark)
import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext

glueContext = GlueContext(SparkContext.getOrCreate())
job = Job(glueContext)
# read from S3 → transform → write partitioned Parquet
# Glue Workflow JSON (pseudo)
workflow:
  triggers: [ crawl_raw, etl_sales, update_catalog ]

EMR on EKS

apiVersion: emrcontainers.services.k8s.aws/v1alpha1
kind: VirtualCluster
metadata: { name: emr-eks, namespace: emr }
spec: { containerProvider: { id: eks-cluster, info: { eksInfo: { namespace: spark } } } }

SageMaker Patterns

# Batch Transform
from sagemaker.transformer import Transformer
tr = Transformer(model_name='nlp', instance_count=2, instance_type='ml.m5.xlarge')
tr.transform('s3://input', content_type='text/csv', split_type='Line')
# Real-time endpoint (multi-model)
from sagemaker.multidatamodel import MultiDataModel
mdm = MultiDataModel(name='nlp-mmm', model_data_prefix='s3://models/')
mdm.deploy(initial_instance_count=2, instance_type='ml.c5.large')

API Gateway Patterns

Type: AWS::ApiGateway::Authorizer
Properties:
  Name: jwt
  Type: COGNITO_USER_POOLS
  ProviderARNs: [ arn:aws:cognito-idp:...:userpool/... ]
Type: AWS::WAFv2::WebACLAssociation
Properties: { ResourceArn: !Ref ApiGatewayStageArn, WebACLArn: !Ref WebACLArn }
# Stage variables for blue/green
Variables: { COLOR: blue }

Step Functions (Sagas and Retries)

{
  "StartAt": "ReserveInventory",
  "States": {
    "ReserveInventory": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke", "Retry": [{"ErrorEquals":["States.Timeout"],"MaxAttempts":3,"BackoffRate":2}], "Next":"ChargePayment" },
    "ChargePayment": { "Type": "Task", "Resource": "...", "Catch": [{"ErrorEquals":["*"],"Next":"CompensateInventory"}], "Next":"Ship" },
    "CompensateInventory": { "Type": "Task", "Resource": "...", "End": true },
    "Ship": { "Type": "Task", "Resource": "...", "End": true }
  }
}

EventBridge Scheduler

resource "aws_scheduler_schedule" "nightly" { name = "nightly" schedule_expression = "cron(0 3 * * ? *)" target { arn = aws_lambda_function.job.arn role_arn = aws_iam_role.scheduler.arn } }

ALB Advanced Routing

Rules:
  - Conditions: [ { HostHeader: { Values: ["blue.app.com"] } } ]
    Actions: [ { Type: forward, TargetGroupArn: arn:tg:blue } ]
  - Conditions: [ { PathPattern: { Values: ["/api/*"] } } ]
    Actions: [ { Type: forward, TargetGroupArn: arn:tg:api } ]

EBS/EFS Performance

- EBS: gp3 with provisioned throughput for IO intensive
- EFS: lifecycle policies; access points; performance mode

Aurora/DynamoDB Advanced

Aurora
- Reader lag alarms; failover testing; query plan cache

DynamoDB
- Adaptive capacity; DAX for read latency; streams → Kinesis

ElastiCache Replication/Sharding

resource "aws_elasticache_replication_group" "cache" { automatic_failover_enabled = true engine = "redis" node_type = "cache.r6g.large" num_node_groups = 2 replicas_per_node_group = 1 }

CloudFront Cache Policies

{ "ParametersInCacheKeyAndForwardedToOrigin": { "CookiesConfig": { "CookieBehavior": "none" }, "HeadersConfig": { "HeaderBehavior": "none" }, "QueryStringsConfig": { "QueryStringBehavior": "whitelist", "QueryStrings": { "Items": ["q"] } } } }

GuardDuty Findings and Auto-Remediation

# EventBridge rule on GuardDuty finding → SSM Automation document
# SSM Automation: isolate instance (detach from ELB, add SG)

Incident Manager

resource "aws_ssmincident_response_plan" "p1" { name = "p1-api" incident_template { title = "P1 API outage" impact = 1 } }

CUR Athena Setup and Dashboards

CREATE EXTERNAL TABLE IF NOT EXISTS cur (...)
PARTITIONED BY (bill_billing_period_start_date date)
STORED AS PARQUET LOCATION 's3://cur-bucket/cur/'
SELECT line_item_product_code, SUM(line_item_unblended_cost) AS cost FROM cur WHERE bill_billing_period_start_date >= date_trunc('month', current_date) GROUP BY 1 ORDER BY 2 DESC;

Sustainability Actions

- Tag resources by environment; nightly shutdown for nonprod
- Use Graviton and managed serverless
- Optimize data storage/retention and caching

Additional Runbooks

DynamoDB Throttling
- Increase RCUs/WCUs; enable auto scaling; batch writes; add backoff logic

ECS Task Failing Health Checks
- Check app logs; target group health; CPU/memory; container limits

Extended FAQ (281–420)

  1. Glue vs EMR?
    Glue for serverless ETL; EMR for custom Hadoop/Spark.

  2. Athena partitioning?
    Partition by date and high-cardinality dims.

  3. Kinesis scaling?
    Increase shards; enhance fan-out.

  4. MSK security?
    Private subnets; IAM auth or mTLS.

  5. PrivateLink limits?
    Per AZ; cost considerations.

  6. NAT costs?
    Use endpoints; consolidate egress.

  7. S3 requester pays?
    Consider for shared public datasets.

  8. SQS long polling?
    Enable to reduce costs.

  9. Lambda concurrency?
    Set reserved; protect downstreams.

  10. ALB WAF?
    Associate regional WAF; tune rules.

  11. Cross-account access?
    Resource policies; IAM roles.

  12. Lake Formation tags?
    Tag datasets by sensitivity; grant by LF-Tag.

  13. EKS security?
    IRSA, PSa, NetworkPolicy, EDR.

  14. ECS secrets?
    SSM/Secrets Manager with task role.

  15. RDS connections?
    Proxy or pooling; tune timeouts.

  16. Aurora storage scaling?
    Auto up to 128TB.

  17. DAX cache?
    Use for read-heavy with strict latency.

  18. CloudFront invalidations?
    Automate on deploy; cache policies.

  19. GuardDuty triage?
    Severity; automation; isolate.

  20. Incident Manager?
    On-call escalation; runbooks.

  21. X-Ray sampling?
    Tailored to error regions.

  22. EKS upgrades?
    Use surge; drain; test.

  23. S3 access points?
    Scoped access per app.

  24. Glue bookmarks?
    Incremental ETL runs.

  25. K8s CNI choice?
    Cilium/Calico; security features.

  26. Terraform drift?
    Detect and reconcile; tags.

  27. Budgets alarms?
    Notify and block.

  28. Savings Plans strategy?
    Cover base; avoid over-commit.

  29. Spot best practices?
    Diversify instance types; capacity-optimized.

  30. Final acceptance?
    SLOs, DR tested, costs tracked, security baseline.


Control Tower Guardrails (SCP Examples)

{
  "Version": "2012-10-17",
  "Statement": [
    { "Sid": "DenyUnapprovedRegions", "Effect": "Deny", "Action": "*", "Resource": "*", "Condition": { "StringNotEquals": { "aws:RequestedRegion": ["us-east-1","us-west-2","eu-west-1"] } } },
    { "Sid": "DenyWithoutTLS", "Effect": "Deny", "Action": "s3:*", "Resource": "*", "Condition": { "Bool": { "aws:SecureTransport": "false" } } }
  ]
}

IAM Access Analyzer

resource "aws_accessanalyzer_analyzer" "org" { analyzer_name = "org-access" type = "ORGANIZATION" }
- Review external access findings weekly; auto-remediate with SSM Automation where safe

SSM Patch Manager

resource "aws_ssm_patch_baseline" "linux" {
  name            = "linux-baseline"
  approved_patches_compliance_level = "CRITICAL"
  operating_system = "AMAZON_LINUX_2"
}

resource "aws_ssm_maintenance_window" "patch" { name = "patch-tuesday" schedule = "cron(0 3 ? * TUE *)" duration = 3 cutoff = 1 }

CloudFront Functions / Lambda@Edge

function handler(event) {
  var req = event.request
  // Security headers
  req.headers['x-forwarded-proto'] = { value: 'https' }
  return req
}
exports.handler = async (event) => {
  const resp = event.Records[0].cf.response
  resp.headers['strict-transport-security'] = [{ key: 'Strict-Transport-Security', value: 'max-age=31536000; includeSubDomains' }]
  return resp
}

S3 Replication Policies

{
  "Role": "arn:aws:iam::123:role/s3-replication",
  "Rules": [
    { "ID": "cross-region", "Status": "Enabled", "Prefix": "", "DeleteMarkerReplication": {"Status": "Enabled"}, "Destination": { "Bucket": "arn:aws:s3:::bucket-dr", "StorageClass": "STANDARD_IA" } }
  ]
}

EKS Add-ons

# CoreDNS values (increase replicas)
replicaCount: 3
# VPC CNI config (ipam)
WARM_ENI_TARGET: 1
WARM_IP_TARGET: 5
# EBS CSI driver StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata: { name: gp3 } 
provisioner: ebs.csi.aws.com
parameters: { type: gp3, encrypted: "true" }

WAF Managed Rules Tuning

rule: AWS-AWSManagedRulesKnownBadInputsRuleSet
exclusions:
  - name: GenericLFI_BODY

Inspector Automation

# EventBridge rule -> SSM Automation to quarantine EC2 with critical finding

Systems Manager Automation Runbooks

schemaVersion: '0.3'
description: Isolate instance from ALB
documentType: Automation
mainSteps:
  - name: DetachFromTargetGroup
    action: 'aws:executeAwsApi'
    inputs: { Service: ElasticLoadBalancingV2, Api: DeregisterTargets, TargetGroupArn: '{{tg}}', Targets: [ { Id: '{{instanceId}}' } ] }

EventBridge Buses and Rules

resource "aws_cloudwatch_event_bus" "apps" { name = "apps-bus" }
resource "aws_cloudwatch_event_rule" "order_created" { event_bus_name = aws_cloudwatch_event_bus.apps.name event_pattern = jsonencode({ source = ["app.orders"], detail_type = ["order-created"] }) }

Data Transfer Optimization (Gateway Endpoints)

resource "aws_vpc_endpoint" "s3" { vpc_id = aws_vpc.main.id service_name = "com.amazonaws.us-east-1.s3" vpc_endpoint_type = "Gateway" route_table_ids = aws_vpc.main.private_route_table_ids }

VPC Lattice

resource "aws_vpclattice_service_network" "sn" { name = "apps" }
resource "aws_vpclattice_service" "svc" { name = "orders" auth_type = "AWS_IAM" }

Additional Dashboards and Alarms

{ "widgets": [ { "type":"metric", "properties": { "metrics": [["AWS/RDS","CPUUtilization","DBClusterIdentifier","aurora"]], "stat":"Average","period":60 } } ] }
resource "aws_cloudwatch_metric_alarm" "dynamo_throttle" {
  alarm_name          = "dynamo-throttle"
  namespace           = "AWS/DynamoDB"
  metric_name         = "ThrottledRequests"
  comparison_operator = "GreaterThanThreshold"
  threshold           = 10
  evaluation_periods  = 5
  period              = 60
  statistic           = "Sum"
}

Extended FAQ (421–500)

  1. Control Tower guardrails?
    SCPs to deny risky actions; region controls.

  2. Access Analyzer cadence?
    Weekly review and auto-remediation.

  3. Patch Manager baselines?
    Per OS; maintenance windows.

  4. CloudFront Functions vs Lambda@Edge?
    Functions for light header logic; Edge for heavy.

  5. S3 CRR costs?
    Per GB + requests; replicate only needed.

  6. EKS CNI tuning?
    Warm IP targets to reduce latency.

  7. WAF tuning?
    Managed rules with exclusions.

  8. Inspector alerts?
    Hook to Incident Manager.

  9. SSM Automation?
    Standardize fixes.

  10. Event buses?
    Separate per domain.

  11. Gateway endpoints?
    Reduce NAT costs.

  12. Lattice use?
    Service-to-service across VPCs.

  13. RDS alarms?
    CPU, connections, free storage.

  14. DynamoDB throttles?
    Auto scaling and adaptive capacity.

  15. Final readiness?
    Security, cost, DR, and monitoring green.

Related posts