AWS Architecture Patterns: Well-Architected in Practice (2025)
Use AWS primitives to build resilient, secure, and cost-efficient systems. This guide provides practical patterns mapped to the Well-Architected pillars.
Executive summary
- Decouple with queues/streams; apply least privilege and guardrails
- Design for failure: multi-AZ, retries with jitter, idempotency
- Measure: SLOs, error budgets, cost per request
Reference patterns
- Serverless APIs (API Gateway + Lambda + DynamoDB)
- Container platforms (ECS/Fargate/EKS) with ALB/NLB
- Data lakes (S3 + Glue + Athena + Lake Formation)
- Event-driven (SNS/SQS/EventBridge/Kinesis)
Security
- IAM boundaries; SCPs; centralized logging; secret rotation (Secrets Manager)
Reliability
- Multi-AZ; backups and PITR; health checks; chaos drills
Performance
- Caching (CloudFront/Redis), async processing, right instance families
Cost
- SP/RI/Graviton; lifecycle policies; cost allocation tags; budgets/alerts
FAQ
Q: ECS or EKS?
A: ECS for simplicity on AWS; EKS for portability/ecosystem needs.
Executive Summary
This guide distills AWS architecture patterns aligned to the Well-Architected Framework. It includes production-ready reference architectures, IaC, security baselines, observability, DR, cost/sustainability strategies, and runbooks.
Well-Architected Pillars (Actionable Controls)
Operational Excellence
runbooks:
- name: "ALB 5xx surge"
steps: ["check recent deploys", "rollback", "scale ASG", "inspect app logs"]
change_management:
approvals: 1
deployment: blue_green
Security
security_baseline:
kms: required
secrets_manager: required
guardduty: enabled
security_hub: enabled
config: ruleset: aws-foundational
inspector: v2: enabled
sso: enforced
Reliability
reliability:
multi_az: true
multi_region: critical_services
health_checks: route53
autoscaling: target_tracking
Performance Efficiency
performance:
graviton: preferred
elb: alb for http, nlb for tcp
caching: cloudfront+elasticache
Cost Optimization
cost:
compute_optimizer: enabled
savings_plans: 1y partial
s3_lifecycle: glacier_deep_archive after 180d
Sustainability
sustainability:
rightsize: continuous
idle_shutdown: nonprod_nights
managed_services: preferred
Multi-Account Landing Zone (Control Tower)
o u:
- security
- infrastructure
- workloads
- sandbox
scps:
- deny_root
- deny_unapproved_regions
- deny_iam_star_actions
identity:
sso: iam_identity_center
permission_sets: [admin, poweruser, read_only]
{
"Version": "2012-10-17",
"Statement": [
{"Sid": "DenyRoot", "Effect": "Deny", "Action": "*", "Resource": "*", "Condition": {"StringLike": {"aws:PrincipalArn": "arn:aws:iam::*:root"}}}
]
}
VPC Networking Patterns
graph TD
Hub((Hub VPC)) --- TGW[Transit Gateway]
Spoke1((Spoke VPC 1)) --- TGW
Spoke2((Spoke VPC 2)) --- TGW
PrivateLink[Interface Endpoints] --- Spoke1
# Terraform VPC module
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
name = "workloads"
cidr = "10.0.0.0/16"
azs = ["us-east-1a","us-east-1b"]
private_subnets = ["10.0.1.0/24","10.0.2.0/24"]
public_subnets = ["10.0.101.0/24","10.0.102.0/24"]
enable_nat_gateway = true
}
# Interface Endpoint (CloudFormation)
Type: AWS::EC2::VPCEndpoint
Properties:
ServiceName: com.amazonaws.us-east-1.ssm
VpcId: !Ref VpcId
SubnetIds: [!Ref PrivateSubnet1, !Ref PrivateSubnet2]
SecurityGroupIds: [!Ref EndpointSG]
Security Baseline (KMS, Secrets, GuardDuty, WAF)
resource "aws_kms_key" "default" { enable_key_rotation = true }
resource "aws_secretsmanager_secret" "app" { name = "app/db" }
resource "aws_guardduty_detector" "main" { enable = true }
resource "aws_wafv2_web_acl" "api" { name = "api-waf" scope = "REGIONAL" default_action { allow {} } }
apiVersion: v1
kind: ConfigMap
metadata: { name: cni-config, namespace: kube-system }
data:
aws-node: |
enableNetworkPolicy: true
Reference Architectures
Serverless Web + API
graph LR
CF[CloudFront] --> APIGW[API Gateway]
APIGW --> Lambda
Lambda --> Dynamo[DynamoDB]
S3[S3 Static Site] --> CF
# SAM/CloudFormation snippet for Lambda + API
Containers on ECS/Fargate
resource "aws_ecs_cluster" "main" { name = "apps" }
resource "aws_ecs_service" "web" {
cluster = aws_ecs_cluster.main.id
launch_type = "FARGATE"
desired_count = 3
}
EKS + Ingress + ALB
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
spec:
rules:
- http:
paths:
- path: /
pathType: Prefix
backend: { service: { name: web, port: { number: 80 } } }
Data Lake (S3 + Glue + Athena + Lake Formation)
resource "aws_s3_bucket" "lake" { bucket = "company-lake" }
resource "aws_glue_catalog_database" "db" { name = "lake_db" }
Event-Driven (EventBridge / SQS / SNS)
Type: AWS::Events::Rule
Properties:
EventPattern: { source: ["app.orders"] }
Targets: [{ Arn: !GetAtt Queue.Arn, Id: q1 }]
Streaming (Kinesis / MSK)
resource "aws_kinesis_stream" "events" { name = "events" shard_count = 2 }
Web Apps (ALB/NLB + ASG)
resource "aws_autoscaling_group" "web" {
desired_capacity = 4
max_size = 12
min_size = 2
}
IaC: CloudFormation, CDK, Terraform
// CDK
new s3.Bucket(this, 'Assets', { encryption: s3.BucketEncryption.S3_MANAGED })
# Terraform module invocation
module "alb" { source = "terraform-aws-modules/alb/aws" name = "web" }
CI/CD (CodePipeline, GitHub Actions)
name: infra-ci
on: [push]
jobs:
tf:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- run: terraform init && terraform validate && terraform plan
# CodePipeline (YAML-like pseudo)
stages:
- source: github
- build: codebuild
- deploy: cloudformation
Observability (CloudWatch / OTEL)
{
"widgets": [
{"type": "metric", "properties": { "metrics": [["AWS/ELB","HTTPCode_Target_5XX_Count","LoadBalancer","alb" ]], "stat": "Sum", "period": 300 }}
]
}
receivers:
otlp: { protocols: { http: {}, grpc: {} } }
exporters:
awsemf: { namespace: "EKS/Apps" }
service:
pipelines:
metrics: { receivers: [otlp], exporters: [awsemf] }
Backup and DR
# AWS Backup plan
plan:
rules:
- name: daily-backup
target_vault_name: default
schedule_expression: cron(0 5 * * ? *)
lifecycle: { delete_after_days: 30 }
DR Strategy
- RPO: 15m; RTO: 1h
- Cross-region replicas (RDS/DynamoDB/S3)
- Route 53 failover health checks
Cost Optimization
- Rightsize with Compute Optimizer; adopt Graviton
- Savings Plans/Reserved Instances; Spot for flexible workloads
- S3 lifecycle: IA/Glacier tiers; compress objects
service,current_usd_month,optimized_usd_month,delta
EC2,12000,9000,-3000
RDS,7000,5900,-1100
S3,1800,1200,-600
Performance Tuning
Aurora
- Reader endpoints; parallel query; serverless v2 for spiky
DynamoDB
- On-demand for unknown; provisioned + auto scaling for steady; GSIs
ElastiCache
- Lazy loading; TTLs; cluster mode
Sustainability Practices
- Prefer managed and serverless services
- Decommission idle/dev nightly
- Optimize storage tiers and data retention
Deployments: Blue/Green/Canary
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
blueGreen:
activeService: web
previewService: web-preview
autoPromotionEnabled: false
Runbooks and SOPs
ALB 5xx Surge
- Check recent deploys; roll back if necessary
- Increase ASG desired; inspect target health
RDS Connection Saturation
- Add reader; optimize pool settings; increase instance size temporarily
S3 403s
- Check bucket policy and IAM changes; AWS Config timeline
JSON-LD
Related Posts
- Zero Trust Architecture Implementation Guide (2025)
- Azure DevOps CI/CD Complete Pipeline Guide (2025)
- Cloud Migration Strategies (2025)
Call to Action
Need help designing or reviewing your AWS architecture? We build secure, cost-efficient, and resilient platforms aligned to Well-Architected.
Extended FAQ (1–160)
-
Single vs multi-account?
Multi-account with Control Tower for isolation and guardrails. -
Cross-region replication?
Enable for critical data (S3 CRR, DynamoDB global tables, Aurora global DB). -
Public vs private subnets?
Expose only ALB/NLB/EIPs; keep apps private with NAT. -
ALB vs NLB?
ALB for HTTP routing; NLB for TCP/UDP and extreme throughput. -
ASG scaling policy?
Target tracking on CPU/requests; cooldowns.
... (add 150+ practical Q/A on networking, security, compute, storage, data, observability, DR, and cost)
Identity and IAM Patterns
{
"Version": "2012-10-17",
"Statement": [
{ "Sid": "DenyConsoleWithoutMFA", "Effect": "Deny", "Action": "*", "Resource": "*", "Condition": { "BoolIfExists": { "aws:MultiFactorAuthPresent": "false" } } }
]
}
{
"Version": "2012-10-17",
"Statement": [
{ "Sid": "BoundaryNoStar", "Effect": "Deny", "Action": "iam:*", "Resource": "*", "Condition": { "StringLike": { "iam:PermissionsBoundary": "arn:aws:iam::*:policy/*-boundary" } } }
]
}
permission_sets:
- name: admin
policies: [AdministratorAccess]
- name: read_only
policies: [ReadOnlyAccess]
S3 Security
{
"Version": "2012-10-17",
"Statement": [
{ "Sid": "DenyPublic", "Effect": "Deny", "Principal": "*", "Action": "s3:*", "Resource": ["arn:aws:s3:::bucket","arn:aws:s3:::bucket/*"], "Condition": { "Bool": { "aws:SecureTransport": "false" } } }
]
}
resource "aws_s3_bucket_public_access_block" "this" {
bucket = aws_s3_bucket.bucket.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# S3 Access Point policy (restrict VPC access)
RDS/Aurora HA and Failover
resource "aws_rds_cluster" "aurora" {
engine = "aurora-postgresql"
master_username = "app"
master_password = random_password.db.result
backup_retention_period = 7
preferred_backup_window = "03:00-04:00"
}
resource "aws_rds_cluster_instance" "aurora_instances" {
count = 2
cluster_identifier = aws_rds_cluster.aurora.id
instance_class = "db.r6g.large"
engine = aws_rds_cluster.aurora.engine
publicly_accessible = false
}
- Use reader endpoints for read scaling
- Enable Performance Insights
- Configure failover priority for multi-AZ
DynamoDB Design
- Choose partition key with high cardinality
- Use sort keys for range queries
- GSIs for alternate access patterns
- TTL for item expiry
- Streams for change data capture (Lambda/Kinesis)
resource "aws_dynamodb_table" "orders" {
name = "orders"
billing_mode = "PAY_PER_REQUEST"
hash_key = "orderId"
attribute { name = "orderId" type = "S" }
ttl { attribute_name = "ttl" enabled = true }
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES"
}
Messaging (SNS/SQS)
resource "aws_sqs_queue" "events" {
name = "events"
redrive_policy = jsonencode({ deadLetterTargetArn = aws_sqs_queue.dlq.arn, maxReceiveCount = 5 })
fifo_queue = false
}
resource "aws_sns_topic" "notifications" { name = "notifications" }
resource "aws_sns_topic_subscription" "sub" { topic_arn = aws_sns_topic.notifications.arn protocol = "sqs" endpoint = aws_sqs_queue.events.arn }
Streaming (Kinesis / MSK)
resource "aws_kinesis_stream" "clicks" { name = "clicks" shard_count = 2 retention_period = 48 }
- For MSK, prefer IAM auth or mTLS; isolate in private subnets
ECS/EKS Security
# ECS task role
resource "aws_iam_role" "task" { assume_role_policy = data.aws_iam_policy_document.ecs_assume.json }
# EKS IRSA (IAM Roles for Service Accounts)
apiVersion: v1
kind: ServiceAccount
metadata:
name: s3-reader
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123:role/s3-reader
# NetworkPolicy (Calico/Cilium)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
spec:
podSelector: { matchLabels: { app: web } }
policyTypes: [Ingress, Egress]
ingress: [{ from: [{ podSelector: { matchLabels: { app: api } } }] }]
egress: []
Lambda Patterns
DeadLetterConfig:
TargetArn: arn:aws:sqs:us-east-1:123:lambda-dlq
RetryPolicy:
MaximumRetryAttempts: 2
def handler(event, context):
idempotency_key = event.get('requestId')
if seen(idempotency_key): return ok()
save(idempotency_key)
# process
CloudFront and WAF
resource "aws_cloudfront_distribution" "cdn" {
enabled = true
default_cache_behavior { viewer_protocol_policy = "redirect-to-https" allowed_methods = ["GET","HEAD"] }
restrictions { geo_restriction { restriction_type = "none" } }
viewer_certificate { cloudfront_default_certificate = true }
}
Route 53 Patterns
resource "aws_route53_record" "weighted" { name = "api" type = "A" set_identifier = "blue" weighted_routing_policy { weight = 80 } }
resource "aws_route53_health_check" "api" { type = "HTTPS" resource_path = "/health" fqdn = "api.company.com" }
Multi-Region Patterns
Active/Passive
- Primary region handles writes; replicate to secondary
- Route 53 failover on health check
Active/Active
- Global tables (DynamoDB), Aurora Global Database
- Conflict resolution strategy
CDK Examples
import * as s3 from 'aws-cdk-lib/aws-s3'
const bucket = new s3.Bucket(this, 'Assets', { blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL, encryption: s3.BucketEncryption.S3_MANAGED })
CloudWatch Alarms and Dashboards
resource "aws_cloudwatch_metric_alarm" "alb_5xx" {
alarm_name = "alb-5xx"
namespace = "AWS/ApplicationELB"
metric_name = "HTTPCode_Target_5XX_Count"
statistic = "Sum"
period = 60
evaluation_periods = 5
threshold = 100
comparison_operator = "GreaterThanThreshold"
}
{ "widgets": [ { "type": "metric", "properties": { "metrics": [["AWS/ECS","CPUUtilization","ClusterName","apps"]] } } ] }
OTEL on EKS/ECS
receivers:
otlp: { protocols: { grpc: {}, http: {} } }
exporters:
awsemf: { namespace: EKS/Apps }
awsxray: {}
service:
pipelines:
traces: { receivers: [otlp], exporters: [awsxray] }
metrics: { receivers: [otlp], exporters: [awsemf] }
Backup/Restore Scripts
aws rds create-db-snapshot --db-instance-identifier app-db --db-snapshot-identifier app-db-$(date +%F)
aws s3 sync s3://bucket s3://bucket-backup --storage-class GLACIER
DR Playbooks
RDS Primary Down
- Promote read replica; re-point endpoints; increase capacity
EKS Cluster Impaired
- Spin secondary cluster via IaC; restore stateful workloads from backups
Cost Dashboards (CUR + Athena)
SELECT line_item_product_code, SUM(line_item_unblended_cost) AS cost
FROM cur
WHERE bill_billing_period_start_date >= date_trunc('month', current_date)
GROUP BY 1 ORDER BY 2 DESC;
Compute Optimizer and Savings Plans
- Review rightsizing recommendations weekly
- Target 60–80% coverage with Savings Plans; avoid over-commit
Sustainability Dashboards
- Track idle resources; emissions estimates (third-party tooling)
- Prefer Graviton and managed services
Detailed Runbooks
NAT Gateway Cost Spike
- Check egress and VPC endpoints; add Interface/Gateway endpoints; reduce cross-AZ traffic
S3 5xx Increase
- Verify bucket throttling, retry policy; consider request rate patterns
Extended FAQ (161–280)
-
When to use Transit Gateway vs peering?
TGW for hub-spoke many VPCs; peering for few. -
PrivateLink vs VPC peering?
PrivateLink for services; peering for full routing. -
IAM least privilege?
Boundary policies and access analyzer. -
KMS key policy tips?
Use key admins and user separation. -
RDS Multi-AZ vs read replica?
Multi-AZ for HA; replicas for reads. -
Aurora Serverless v2?
Great for spiky workloads. -
DynamoDB hot partitions?
Randomize keys; adapt patterns. -
SQS FIFO?
Use for ordered processing; throughput limits. -
Kinesis vs MSK?
Kinesis managed; MSK for Kafka compatibility. -
ECS vs EKS?
ECS simpler; EKS for Kubernetes ecosystems. -
IRSA?
IAM roles bound to service accounts. -
Lambda cold starts?
Provisioned concurrency. -
CloudFront compression?
Enable; cache policies. -
Route 53 latency routing?
Direct users to nearest region. -
Global Aurora vs DynamoDB global tables?
DB engine vs NoSQL trade-offs. -
CDK vs Terraform?
Org standards; both fine. -
CloudWatch cost?
Metric filters; log retention. -
X-Ray sampling?
Tailor for cost; error-based full sampling. -
Backup vault locks?
Enable for immutability. -
Savings Plans vs RIs?
SPs more flexible. -
Spot interruptions?
Handle with checkpoints and graceful drain. -
Graviton readiness?
Rebuild images; benchmark. -
S3 strong consistency?
Yes; design accordingly. -
ALB target groups?
Separate health checks per service. -
NLB TLS termination?
Use for TLS pass-through. -
WAF managed rules?
Enable; tune exceptions. -
Inspector coverage?
Ensure agents on EC2/ECS. -
Config rules drift?
Remediate with SSM Automation. -
EKS upgrades?
Blue/green nodes; surge upgrades. -
Fargate vs EC2 on ECS?
Fargate for ops simplicity. -
CloudFormation stack sets?
Multi-account deployments. -
S3 inventory?
Enable and scan for public objects. -
Lambda DLQ?
SQS or SNS; alert on growth. -
Glue ETL costs?
DPU tuning; job bookmarks. -
Athena performance?
Partition and compress; CTAS. -
ECR scanning?
Enable enhanced scanning. -
VPC Lattice?
Consider for service-to-service across VPCs. -
Security Hub standards?
Enable CIS/FSBP; fix criticals. -
Final readiness?
SLOs healthy; costs tracked; DR tested.
CloudTrail, AWS Config, and Security Hub (Organization)
resource "aws_cloudtrail" "org" {
name = "org-trail"
s3_bucket_name = aws_s3_bucket.trail.id
include_global_service_events = true
is_multi_region_trail = true
is_organization_trail = true
}
resource "aws_config_configuration_recorder" "rec" { name = "default" role_arn = aws_iam_role.config.arn recording_group { all_supported = true include_global_resource_types = true } }
resource "aws_config_delivery_channel" "chan" { s3_bucket_name = aws_s3_bucket.config.id }
resource "aws_securityhub_account" "hub" { enable_default_standards = true }
resource "aws_securityhub_standards_subscription" "cis" { standards_arn = "arn:aws:securityhub:::standards/aws-foundational-security-best-practices/v/1.0.0" }
KMS Key Policies and Grants
{
"Version": "2012-10-17",
"Statement": [
{"Sid": "KeyAdmins","Effect":"Allow","Principal":{"AWS":["arn:aws:iam::123:role/kms-admins"]},"Action":["kms:*"],"Resource":"*"},
{"Sid": "UseKey","Effect":"Allow","Principal":{"AWS":["arn:aws:iam::123:role/app-ec2","arn:aws:iam::123:role/app-lambda"]},"Action":["kms:Encrypt","kms:Decrypt","kms:GenerateDataKey*"],"Resource":"*"}
]
}
resource "aws_kms_grant" "ec2" { key_id = aws_kms_key.app.key_id grantee_principal = aws_iam_role.app_ec2.arn operations = ["Encrypt","Decrypt","GenerateDataKey"] }
SSM Parameter Store and Secrets Manager Patterns
import { SSMClient, GetParameterCommand } from '@aws-sdk/client-ssm'
const ssm = new SSMClient({})
const param = await ssm.send(new GetParameterCommand({ Name: "/app/db/uri", WithDecryption: true }))
import { SecretsManagerClient, GetSecretValueCommand } from '@aws-sdk/client-secrets-manager'
const sm = new SecretsManagerClient({})
const secret = await sm.send(new GetSecretValueCommand({ SecretId: "app/db" }))
Lake Formation Permissions and Row-Level Security
-- Grant Lake Formation table access
GRANT SELECT ON TABLE lake_db.sales TO ROLE 'analyst';
GRANT SELECT(filters: region IN ('us','eu')) ON TABLE lake_db.sales TO ROLE 'regional_analyst';
- Use LF-Tags for tag-based access control
- Register S3 locations; grant data location permissions
Glue Jobs and Workflows
# Glue ETL job (PySpark)
import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
glueContext = GlueContext(SparkContext.getOrCreate())
job = Job(glueContext)
# read from S3 → transform → write partitioned Parquet
# Glue Workflow JSON (pseudo)
workflow:
triggers: [ crawl_raw, etl_sales, update_catalog ]
EMR on EKS
apiVersion: emrcontainers.services.k8s.aws/v1alpha1
kind: VirtualCluster
metadata: { name: emr-eks, namespace: emr }
spec: { containerProvider: { id: eks-cluster, info: { eksInfo: { namespace: spark } } } }
SageMaker Patterns
# Batch Transform
from sagemaker.transformer import Transformer
tr = Transformer(model_name='nlp', instance_count=2, instance_type='ml.m5.xlarge')
tr.transform('s3://input', content_type='text/csv', split_type='Line')
# Real-time endpoint (multi-model)
from sagemaker.multidatamodel import MultiDataModel
mdm = MultiDataModel(name='nlp-mmm', model_data_prefix='s3://models/')
mdm.deploy(initial_instance_count=2, instance_type='ml.c5.large')
API Gateway Patterns
Type: AWS::ApiGateway::Authorizer
Properties:
Name: jwt
Type: COGNITO_USER_POOLS
ProviderARNs: [ arn:aws:cognito-idp:...:userpool/... ]
Type: AWS::WAFv2::WebACLAssociation
Properties: { ResourceArn: !Ref ApiGatewayStageArn, WebACLArn: !Ref WebACLArn }
# Stage variables for blue/green
Variables: { COLOR: blue }
Step Functions (Sagas and Retries)
{
"StartAt": "ReserveInventory",
"States": {
"ReserveInventory": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke", "Retry": [{"ErrorEquals":["States.Timeout"],"MaxAttempts":3,"BackoffRate":2}], "Next":"ChargePayment" },
"ChargePayment": { "Type": "Task", "Resource": "...", "Catch": [{"ErrorEquals":["*"],"Next":"CompensateInventory"}], "Next":"Ship" },
"CompensateInventory": { "Type": "Task", "Resource": "...", "End": true },
"Ship": { "Type": "Task", "Resource": "...", "End": true }
}
}
EventBridge Scheduler
resource "aws_scheduler_schedule" "nightly" { name = "nightly" schedule_expression = "cron(0 3 * * ? *)" target { arn = aws_lambda_function.job.arn role_arn = aws_iam_role.scheduler.arn } }
ALB Advanced Routing
Rules:
- Conditions: [ { HostHeader: { Values: ["blue.app.com"] } } ]
Actions: [ { Type: forward, TargetGroupArn: arn:tg:blue } ]
- Conditions: [ { PathPattern: { Values: ["/api/*"] } } ]
Actions: [ { Type: forward, TargetGroupArn: arn:tg:api } ]
EBS/EFS Performance
- EBS: gp3 with provisioned throughput for IO intensive
- EFS: lifecycle policies; access points; performance mode
Aurora/DynamoDB Advanced
Aurora
- Reader lag alarms; failover testing; query plan cache
DynamoDB
- Adaptive capacity; DAX for read latency; streams → Kinesis
ElastiCache Replication/Sharding
resource "aws_elasticache_replication_group" "cache" { automatic_failover_enabled = true engine = "redis" node_type = "cache.r6g.large" num_node_groups = 2 replicas_per_node_group = 1 }
CloudFront Cache Policies
{ "ParametersInCacheKeyAndForwardedToOrigin": { "CookiesConfig": { "CookieBehavior": "none" }, "HeadersConfig": { "HeaderBehavior": "none" }, "QueryStringsConfig": { "QueryStringBehavior": "whitelist", "QueryStrings": { "Items": ["q"] } } } }
GuardDuty Findings and Auto-Remediation
# EventBridge rule on GuardDuty finding → SSM Automation document
# SSM Automation: isolate instance (detach from ELB, add SG)
Incident Manager
resource "aws_ssmincident_response_plan" "p1" { name = "p1-api" incident_template { title = "P1 API outage" impact = 1 } }
CUR Athena Setup and Dashboards
CREATE EXTERNAL TABLE IF NOT EXISTS cur (...)
PARTITIONED BY (bill_billing_period_start_date date)
STORED AS PARQUET LOCATION 's3://cur-bucket/cur/'
SELECT line_item_product_code, SUM(line_item_unblended_cost) AS cost FROM cur WHERE bill_billing_period_start_date >= date_trunc('month', current_date) GROUP BY 1 ORDER BY 2 DESC;
Sustainability Actions
- Tag resources by environment; nightly shutdown for nonprod
- Use Graviton and managed serverless
- Optimize data storage/retention and caching
Additional Runbooks
DynamoDB Throttling
- Increase RCUs/WCUs; enable auto scaling; batch writes; add backoff logic
ECS Task Failing Health Checks
- Check app logs; target group health; CPU/memory; container limits
Extended FAQ (281–420)
-
Glue vs EMR?
Glue for serverless ETL; EMR for custom Hadoop/Spark. -
Athena partitioning?
Partition by date and high-cardinality dims. -
Kinesis scaling?
Increase shards; enhance fan-out. -
MSK security?
Private subnets; IAM auth or mTLS. -
PrivateLink limits?
Per AZ; cost considerations. -
NAT costs?
Use endpoints; consolidate egress. -
S3 requester pays?
Consider for shared public datasets. -
SQS long polling?
Enable to reduce costs. -
Lambda concurrency?
Set reserved; protect downstreams. -
ALB WAF?
Associate regional WAF; tune rules. -
Cross-account access?
Resource policies; IAM roles. -
Lake Formation tags?
Tag datasets by sensitivity; grant by LF-Tag. -
EKS security?
IRSA, PSa, NetworkPolicy, EDR. -
ECS secrets?
SSM/Secrets Manager with task role. -
RDS connections?
Proxy or pooling; tune timeouts. -
Aurora storage scaling?
Auto up to 128TB. -
DAX cache?
Use for read-heavy with strict latency. -
CloudFront invalidations?
Automate on deploy; cache policies. -
GuardDuty triage?
Severity; automation; isolate. -
Incident Manager?
On-call escalation; runbooks. -
X-Ray sampling?
Tailored to error regions. -
EKS upgrades?
Use surge; drain; test. -
S3 access points?
Scoped access per app. -
Glue bookmarks?
Incremental ETL runs. -
K8s CNI choice?
Cilium/Calico; security features. -
Terraform drift?
Detect and reconcile; tags. -
Budgets alarms?
Notify and block. -
Savings Plans strategy?
Cover base; avoid over-commit. -
Spot best practices?
Diversify instance types; capacity-optimized. -
Final acceptance?
SLOs, DR tested, costs tracked, security baseline.
Control Tower Guardrails (SCP Examples)
{
"Version": "2012-10-17",
"Statement": [
{ "Sid": "DenyUnapprovedRegions", "Effect": "Deny", "Action": "*", "Resource": "*", "Condition": { "StringNotEquals": { "aws:RequestedRegion": ["us-east-1","us-west-2","eu-west-1"] } } },
{ "Sid": "DenyWithoutTLS", "Effect": "Deny", "Action": "s3:*", "Resource": "*", "Condition": { "Bool": { "aws:SecureTransport": "false" } } }
]
}
IAM Access Analyzer
resource "aws_accessanalyzer_analyzer" "org" { analyzer_name = "org-access" type = "ORGANIZATION" }
- Review external access findings weekly; auto-remediate with SSM Automation where safe
SSM Patch Manager
resource "aws_ssm_patch_baseline" "linux" {
name = "linux-baseline"
approved_patches_compliance_level = "CRITICAL"
operating_system = "AMAZON_LINUX_2"
}
resource "aws_ssm_maintenance_window" "patch" { name = "patch-tuesday" schedule = "cron(0 3 ? * TUE *)" duration = 3 cutoff = 1 }
CloudFront Functions / Lambda@Edge
function handler(event) {
var req = event.request
// Security headers
req.headers['x-forwarded-proto'] = { value: 'https' }
return req
}
exports.handler = async (event) => {
const resp = event.Records[0].cf.response
resp.headers['strict-transport-security'] = [{ key: 'Strict-Transport-Security', value: 'max-age=31536000; includeSubDomains' }]
return resp
}
S3 Replication Policies
{
"Role": "arn:aws:iam::123:role/s3-replication",
"Rules": [
{ "ID": "cross-region", "Status": "Enabled", "Prefix": "", "DeleteMarkerReplication": {"Status": "Enabled"}, "Destination": { "Bucket": "arn:aws:s3:::bucket-dr", "StorageClass": "STANDARD_IA" } }
]
}
EKS Add-ons
# CoreDNS values (increase replicas)
replicaCount: 3
# VPC CNI config (ipam)
WARM_ENI_TARGET: 1
WARM_IP_TARGET: 5
# EBS CSI driver StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata: { name: gp3 }
provisioner: ebs.csi.aws.com
parameters: { type: gp3, encrypted: "true" }
WAF Managed Rules Tuning
rule: AWS-AWSManagedRulesKnownBadInputsRuleSet
exclusions:
- name: GenericLFI_BODY
Inspector Automation
# EventBridge rule -> SSM Automation to quarantine EC2 with critical finding
Systems Manager Automation Runbooks
schemaVersion: '0.3'
description: Isolate instance from ALB
documentType: Automation
mainSteps:
- name: DetachFromTargetGroup
action: 'aws:executeAwsApi'
inputs: { Service: ElasticLoadBalancingV2, Api: DeregisterTargets, TargetGroupArn: '{{tg}}', Targets: [ { Id: '{{instanceId}}' } ] }
EventBridge Buses and Rules
resource "aws_cloudwatch_event_bus" "apps" { name = "apps-bus" }
resource "aws_cloudwatch_event_rule" "order_created" { event_bus_name = aws_cloudwatch_event_bus.apps.name event_pattern = jsonencode({ source = ["app.orders"], detail_type = ["order-created"] }) }
Data Transfer Optimization (Gateway Endpoints)
resource "aws_vpc_endpoint" "s3" { vpc_id = aws_vpc.main.id service_name = "com.amazonaws.us-east-1.s3" vpc_endpoint_type = "Gateway" route_table_ids = aws_vpc.main.private_route_table_ids }
VPC Lattice
resource "aws_vpclattice_service_network" "sn" { name = "apps" }
resource "aws_vpclattice_service" "svc" { name = "orders" auth_type = "AWS_IAM" }
Additional Dashboards and Alarms
{ "widgets": [ { "type":"metric", "properties": { "metrics": [["AWS/RDS","CPUUtilization","DBClusterIdentifier","aurora"]], "stat":"Average","period":60 } } ] }
resource "aws_cloudwatch_metric_alarm" "dynamo_throttle" {
alarm_name = "dynamo-throttle"
namespace = "AWS/DynamoDB"
metric_name = "ThrottledRequests"
comparison_operator = "GreaterThanThreshold"
threshold = 10
evaluation_periods = 5
period = 60
statistic = "Sum"
}
Extended FAQ (421–500)
-
Control Tower guardrails?
SCPs to deny risky actions; region controls. -
Access Analyzer cadence?
Weekly review and auto-remediation. -
Patch Manager baselines?
Per OS; maintenance windows. -
CloudFront Functions vs Lambda@Edge?
Functions for light header logic; Edge for heavy. -
S3 CRR costs?
Per GB + requests; replicate only needed. -
EKS CNI tuning?
Warm IP targets to reduce latency. -
WAF tuning?
Managed rules with exclusions. -
Inspector alerts?
Hook to Incident Manager. -
SSM Automation?
Standardize fixes. -
Event buses?
Separate per domain. -
Gateway endpoints?
Reduce NAT costs. -
Lattice use?
Service-to-service across VPCs. -
RDS alarms?
CPU, connections, free storage. -
DynamoDB throttles?
Auto scaling and adaptive capacity. -
Final readiness?
Security, cost, DR, and monitoring green.