Skip to content

AWS Well-Architected Framework

Building Secure, High-Performing, Resilient, and Efficient Infrastructure

Section titled “Building Secure, High-Performing, Resilient, and Efficient Infrastructure”

The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building systems on AWS, providing a consistent approach to evaluate architectures.

Well-Architected Framework Pillars
+------------------------------------------------------------------+
| |
| +------------------------+ |
| | Well-Architected | |
| | Framework | |
| +------------------------+ |
| | |
| +-----------+-----------+-----------+-----------+ |
| | | | | | |
| v v v v v |
| +-------+ +-------+ +-------+ +-------+ +-------+ |
| |Security| |Reliabil| |Perform-| |Cost | |Sustain-| |
| | | | ity | | ance | |Optimiz-| | ability| |
| | | | | | | | ation | | | |
| +-------+ +-------+ +-------+ +-------+ +-------+ |
| |
| 1. Security 2. Reliability 3. Performance |
| 4. Cost Optimization 5. Sustainability |
| |
+------------------------------------------------------------------+

Security Pillar Principles
+------------------------------------------------------------------+
| |
| 1. Implement a Strong Identity Foundation |
| +----------------------------------------------------------+ |
| | - Centralize identity management | |
| | - Use IAM for access control | |
| | - Implement least privilege | |
| | - Enforce MFA | |
| +----------------------------------------------------------+ |
| |
| 2. Enable Traceability |
| +----------------------------------------------------------+ |
| | - Monitor and log all actions | |
| | - Use CloudTrail for API auditing | |
| | - Implement alerting | |
| +----------------------------------------------------------+ |
| |
| 3. Apply Security at All Layers |
| +----------------------------------------------------------+ |
| | - Defense in depth | |
| | - Network security (VPC, NACLs, SGs) | |
| | - Application security | |
| | - Data encryption | |
| +----------------------------------------------------------+ |
| |
| 4. Automate Security Best Practices |
| +----------------------------------------------------------+ |
| | - Use managed services | |
| | - Automated patching | |
| | - Security as code | |
| +----------------------------------------------------------+ |
| |
| 5. Protect Data in Transit and at Rest |
| +----------------------------------------------------------+ |
| | - Encryption everywhere | |
| | - TLS for transit | |
| | - KMS for key management | |
| +----------------------------------------------------------+ |
| |
| 6. Prepare for Security Events |
| +----------------------------------------------------------+ |
| | - Incident response plan | |
| | - Automated response (GuardDuty) | |
| | - Regular security testing | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Security Defense in Depth
+------------------------------------------------------------------+
| |
| Layer 1: Edge Security |
| +----------------------------------------------------------+ |
| | +----------+ +----------+ +----------+ | |
| | |CloudFront| | WAF | | Shield | | |
| | | (CDN) | |(Firewall)| | (DDoS) | | |
| | +----------+ +----------+ +----------+ | |
| +----------------------------------------------------------+ |
| | |
| v |
| Layer 2: Network Security |
| +----------------------------------------------------------+ |
| | +----------+ +----------+ +----------+ | |
| | | VPC | | NACLs | |Security | | |
| | | | | | | Groups | | |
| | +----------+ +----------+ +----------+ | |
| +----------------------------------------------------------+ |
| | |
| v |
| Layer 3: Compute Security |
| +----------------------------------------------------------+ |
| | +----------+ +----------+ +----------+ | |
| | | EC2 | | Systems | | Guard | | |
| | | (IAM) | | Manager | | Duty | | |
| | +----------+ +----------+ +----------+ | |
| +----------------------------------------------------------+ |
| | |
| v |
| Layer 4: Data Security |
| +----------------------------------------------------------+ |
| | +----------+ +----------+ +----------+ | |
| | | KMS | | S3 | | RDS | | |
| | |(Encrypt) | |(Encrypt) | |(Encrypt) | | |
| | +----------+ +----------+ +----------+ | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
QuestionBest PracticeAWS Service
How are you managing identities?Centralized IAM, SSOIAM, AWS SSO
How are you controlling access?Least privilege, MFAIAM Policies
How are you protecting network?VPC, Security GroupsVPC, NACLs, SGs
How are you encrypting data?Encryption at rest and transitKMS, ACM
How are you monitoring?Logging, alertingCloudTrail, CloudWatch
How are you responding?Automated responseGuardDuty, Security Hub

Reliability Pillar Principles
+------------------------------------------------------------------+
| |
| 1. Automatically Recover from Failure |
| +----------------------------------------------------------+ |
| | - Implement self-healing | |
| | - Use Auto Scaling | |
| | - Multi-AZ deployments | |
| | - Health checks | |
| +----------------------------------------------------------+ |
| |
| 2. Test Recovery Procedures |
| +----------------------------------------------------------+ |
| | - Regular disaster recovery testing | |
| | - Chaos engineering | |
| | - Game days | |
| +----------------------------------------------------------+ |
| |
| 3. Scale Horizontally |
| +----------------------------------------------------------+ |
| | - Distribute load across resources | |
| | - Avoid single points of failure | |
| | - Use load balancers | |
| +----------------------------------------------------------+ |
| |
| 4. Stop Guessing Capacity |
| +----------------------------------------------------------+ |
| | - Use Auto Scaling | |
| | - Serverless where possible | |
| | - Monitor and adjust | |
| +----------------------------------------------------------+ |
| |
| 5. Automate Change Management |
| +----------------------------------------------------------+ |
| | - Infrastructure as Code | |
| | - Automated deployments | |
| | - Blue/green deployments | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Multi-AZ High Availability
+------------------------------------------------------------------+
| |
| Internet |
| | |
| v |
| +---------------+ |
| | Route 53 | |
| | (DNS) | |
| +---------------+ |
| | |
| v |
| +---------------+ |
| | CloudFront | |
| | (CDN) | |
| +---------------+ |
| | |
| v |
| +-----------------------------------+ |
| | Application Load Balancer | |
| | (Multi-AZ) | |
| +-----------------------------------+ |
| | | | |
| v v v |
| +----------+ +----------+ +----------+ |
| | AZ-A | | AZ-B | | AZ-C | |
| | | | | | | |
| | +------+ | | +------+ | | +------+ | |
| | | EC2 | | | | EC2 | | | | EC2 | | |
| | | Fleet| | | | Fleet| | | | Fleet| | |
| | +------+ | | +------+ | | +------+ | |
| | | | | | | |
| | +------+ | | +------+ | | +------+ | |
| | | RDS | | | | RDS | | | | RDS | | |
| | |Primary| | | |Replica| | | |Replica| | |
| | +------+ | | +------+ | | +------+ | |
| +----------+ +----------+ +----------+ |
| |
| Availability: 99.99% (52.6 min downtime/year) |
| |
+------------------------------------------------------------------+
Disaster Recovery Strategies
+------------------------------------------------------------------+
| |
| Strategy 1: Backup & Restore |
| +----------------------------------------------------------+ |
| | RPO: Hours RTO: Hours | |
| | | |
| | Primary Region Backup Region | |
| | +----------+ +----------+ | |
| | | App | | S3 | | |
| | | | --backup----> | Backups | | |
| | | DB | | | | |
| | +----------+ +----------+ | |
| | | | |
| | v (restore) | |
| | +----------+ | |
| | | App | | |
| | | DB | | |
| | +----------+ | |
| +----------------------------------------------------------+ |
| |
| Strategy 2: Pilot Light |
| +----------------------------------------------------------+ |
| | RPO: Minutes RTO: Minutes | |
| | | |
| | Primary Region DR Region | |
| | +----------+ +----------+ | |
| | | App | | DB | | |
| | | | --repl------> |(Standby) | | |
| | | DB | | | | |
| | +----------+ +----------+ | |
| | | | |
| | v (scale up) | |
| | +----------+ | |
| | | App | | |
| | +----------+ | |
| +----------------------------------------------------------+ |
| |
| Strategy 3: Warm Standby |
| +----------------------------------------------------------+ |
| | RPO: Minutes RTO: Minutes | |
| | | |
| | Primary Region DR Region | |
| | +----------+ +----------+ | |
| | | App | | App | | |
| | | (Full) | --repl------> |(Scaled- | | |
| | | DB | | down) | | |
| | +----------+ | DB | | |
| | +----------+ | |
| | | | |
| | v (scale up) | |
| | +----------+ | |
| | | App | | |
| | | (Full) | | |
| | +----------+ | |
| +----------------------------------------------------------+ |
| |
| Strategy 4: Multi-Region Active-Active |
| +----------------------------------------------------------+ |
| | RPO: Real-time RTO: Real-time | |
| | | |
| | Region A Region B | |
| | +----------+ +----------+ | |
| | | App | | App | | |
| | | (Active) | <---sync---> | (Active) | | |
| | | DB | | DB | | |
| | +----------+ +----------+ | |
| | | |
| | Route 53 routes traffic to both regions | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
MetricDefinitionTarget
RPORecovery Point Objective - max data lossMinutes to hours
RTORecovery Time Objective - max downtimeMinutes to hours
MTTRMean Time To RecoveryMinimize
MTTFMean Time To FailureMaximize
AvailabilityUptime percentage99.9% - 99.999%

Performance Pillar Principles
+------------------------------------------------------------------+
| |
| 1. Democratize Advanced Technologies |
| +----------------------------------------------------------+ |
| | - Use managed services | |
| | - Let AWS handle complexity | |
| | - Focus on business logic | |
| +----------------------------------------------------------+ |
| |
| 2. Go Global in Minutes |
| +----------------------------------------------------------+ |
| | - Deploy to multiple regions | |
| | - Use CloudFront for global reach | |
| | - Edge locations for low latency | |
| +----------------------------------------------------------+ |
| |
| 3. Use Serverless Architectures |
| +----------------------------------------------------------+ |
| | - Lambda for compute | |
| | - DynamoDB for database | |
| | - S3 for storage | |
| | - No server management | |
| +----------------------------------------------------------+ |
| |
| 4. Experiment More Often |
| +----------------------------------------------------------+ |
| | - Quick provisioning | |
| | - Test different configurations | |
| | - A/B testing | |
| +----------------------------------------------------------+ |
| |
| 5. Consider Mechanical Sympathy |
| +----------------------------------------------------------+ |
| | - Choose right instance types | |
| | - Optimize for workload | |
| | - Use appropriate storage types | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Performance Optimization Layers
+------------------------------------------------------------------+
| |
| Layer 1: Caching |
| +----------------------------------------------------------+ |
| | | |
| | Client Cache -> CDN Cache -> App Cache -> DB Cache | |
| | | | | | | |
| | v v v v | |
| | Browser CloudFront ElastiCache RDS/DB | |
| | | |
| | Benefits: | |
| | - Reduced latency | |
| | - Lower database load | |
| | - Better user experience | |
| +----------------------------------------------------------+ |
| |
| Layer 2: Compute Optimization |
| +----------------------------------------------------------+ |
| | | |
| | Workload Type Recommended Service | |
| | +----------------+-------------------+ | |
| | | Web Servers | EC2, ALB, ASG | | |
| | | API Backend | Lambda, API GW | | |
| | | Batch Jobs | Batch, Lambda | | |
| | | Containers | ECS, EKS, Fargate| | |
| | | ML/AI | SageMaker | | |
| | +----------------+-------------------+ | |
| +----------------------------------------------------------+ |
| |
| Layer 3: Database Optimization |
| +----------------------------------------------------------+ |
| | | |
| | Data Pattern Recommended Database | |
| | +----------------+-------------------+ | |
| | | Relational | RDS, Aurora | | |
| | | Key-Value | DynamoDB | | |
| | | Document | DocumentDB | | |
| | | Graph | Neptune | | |
| | | Time Series | Timestream | | |
| | | In-Memory | ElastiCache | | |
| | +----------------+-------------------+ | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Performance Monitoring Stack
+------------------------------------------------------------------+
| |
| +------------------------+ |
| | CloudWatch Dashboard | |
| +------------------------+ |
| | |
| +---------------------+---------------------+ |
| | | | |
| v v v |
| +----------+ +----------+ +----------+ |
| | Metrics | | Logs | | Traces | |
| | | | | | | |
| |CloudWatch| |CloudWatch| | X-Ray | |
| | Metrics | | Logs | | | |
| +----------+ +----------+ +----------+ |
| | | | |
| v v v |
| +----------+ +----------+ +----------+ |
| | Alarms | | Insights | | Service | |
| | | | | | Map | |
| +----------+ +----------+ +----------+ |
| |
| Key Metrics to Monitor: |
| +----------------------------------------------------------+ |
| | - CPU Utilization | |
| | - Memory Utilization | |
| | - Disk I/O | |
| | - Network Throughput | |
| | - Request Latency | |
| | - Error Rates | |
| | - Queue Depth | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

Cost Optimization Pillar Principles
+------------------------------------------------------------------+
| |
| 1. Implement Cloud Financial Management |
| +----------------------------------------------------------+ |
| | - Establish cost awareness | |
| | - Set budgets and alerts | |
| | - Regular cost reviews | |
| | - FinOps practices | |
| +----------------------------------------------------------+ |
| |
| 2. Adopt a Consumption Model |
| +----------------------------------------------------------+ |
| | - Pay for what you use | |
| | - Scale up and down | |
| | - No upfront commitments for variable workloads | |
| +----------------------------------------------------------+ |
| |
| 3. Measure Overall Efficiency |
| +----------------------------------------------------------+ |
| | - Track business metrics | |
| | - Cost per transaction | |
| | - Cost per customer | |
| +----------------------------------------------------------+ |
| |
| 4. Stop Spending Money on Undifferentiated Heavy Lifting |
| +----------------------------------------------------------+ |
| | - Use managed services | |
| | - Focus on competitive advantage | |
| | - Let AWS manage infrastructure | |
| +----------------------------------------------------------+ |
| |
| 5. Analyze and Attribute Expenditure |
| +----------------------------------------------------------+ |
| | - Tag resources | |
| | - Cost allocation | |
| | - Chargeback/showback | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Cost Optimization Techniques
+------------------------------------------------------------------+
| |
| Technique 1: Right-Sizing |
| +----------------------------------------------------------+ |
| | | |
| | Over-provisioned Right-sized | |
| | +----------------+ +----------------+ | |
| | | m5.2xlarge | | m5.large | | |
| | | CPU: 15% | --> | CPU: 60% | | |
| | | Memory: 20% | | Memory: 70% | | |
| | | Cost: $280/mo | | Cost: $70/mo | | |
| | +----------------+ +----------------+ | |
| | | |
| | Tools: Compute Optimizer, Cost Explorer | |
| +----------------------------------------------------------+ |
| |
| Technique 2: Reserved Capacity |
| +----------------------------------------------------------+ |
| | | |
| | Pricing Model Discount Commitment | |
| | +----------------+----------------+-------------+ | |
| | | On-Demand | 0% | None | | |
| | | RI (1 year) | 30-40% | 1 year | | |
| | | RI (3 year) | 50-60% | 3 years | | |
| | | Savings Plans | Up to 72% | 1-3 years | | |
| | +----------------+----------------+-------------+ | |
| +----------------------------------------------------------+ |
| |
| Technique 3: Spot Instances |
| +----------------------------------------------------------+ |
| | | |
| | Use Case Spot Discount | |
| | +----------------+----------------+ | |
| | | Batch Jobs | Up to 90% off | | |
| | | CI/CD | Up to 90% off | | |
| | | Big Data | Up to 90% off | | |
| | | Containerized | Up to 90% off | | |
| | +----------------+----------------+ | |
| +----------------------------------------------------------+ |
| |
| Technique 4: Storage Tiering |
| +----------------------------------------------------------+ |
| | | |
| | Data Age Storage Tier Cost | |
| | +----------------+----------------+-------------+ | |
| | | Hot (0-30 days) | S3 Standard | $0.023/GB | | |
| | | Warm (30-90) | S3 Standard-IA | $0.0125/GB | | |
| | | Cold (90-180) | S3 Glacier | $0.004/GB | | |
| | | Archive (180+) | S3 Glacier Deep | $0.00099/GB | | |
| | +----------------+----------------+-------------+ | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

Sustainability Pillar Principles
+------------------------------------------------------------------+
| |
| 1. Understand Your Impact |
| +----------------------------------------------------------+ |
| | - Measure sustainability metrics | |
| | - Track carbon footprint | |
| | - Set improvement goals | |
| +----------------------------------------------------------+ |
| |
| 2. Establish Sustainability Goals |
| +----------------------------------------------------------+ |
| | - Define targets | |
| | - Align with business objectives | |
| | - Regular reviews | |
| +----------------------------------------------------------+ |
| |
| 3. Maximize Utilization |
| +----------------------------------------------------------+ |
| | - Right-size resources | |
| | - Use serverless | |
| | - Optimize workload scheduling | |
| +----------------------------------------------------------+ |
| |
| 4. Anticipate and Adopt New Hardware |
| +----------------------------------------------------------+ |
| | - Use latest instance generations | |
| | - Leverage AWS efficiency improvements | |
| | - Migrate to more efficient services | |
| +----------------------------------------------------------+ |
| |
| 5. Use Managed Services |
| +----------------------------------------------------------+ |
| | - AWS manages at scale | |
| | - Higher efficiency | |
| | - Shared infrastructure | |
| +----------------------------------------------------------+ |
| |
| 6. Reduce Downstream Impact |
| +----------------------------------------------------------+ |
| | - Optimize data transfer | |
| | - Reduce storage requirements | |
| | - Efficient algorithms | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Sustainability Optimization
+------------------------------------------------------------------+
| |
| Compute Optimization |
| +----------------------------------------------------------+ |
| | - Use Graviton (ARM) instances - 60% more efficient | |
| | - Opt for serverless (Lambda, Fargate) | |
| | - Use Spot instances for batch workloads | |
| | - Implement auto-scaling | |
| +----------------------------------------------------------+ |
| |
| Storage Optimization |
| +----------------------------------------------------------+ |
| | - Use S3 Intelligent-Tiering | |
| | - Implement lifecycle policies | |
| | - Compress data before storage | |
| | - Delete unused snapshots | |
| +----------------------------------------------------------+ |
| |
| Network Optimization |
| +----------------------------------------------------------+ |
| | - Use CloudFront to reduce origin requests | |
| | - Implement caching | |
| | - Use VPC endpoints | |
| | - Optimize data transfer patterns | |
| +----------------------------------------------------------+ |
| |
| Region Selection |
| +----------------------------------------------------------+ |
| | - Choose regions with lower carbon intensity | |
| | - Consider regions powered by renewable energy | |
| | - Balance latency with sustainability | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

Well-Architected Tool Workflow
+------------------------------------------------------------------+
| |
| Step 1: Define Workload |
| +----------------------------------------------------------+ |
| | - Name your workload | |
| | - Select region | |
| | - Define scope | |
| +----------------------------------------------------------+ |
| | |
| v |
| Step 2: Answer Questions |
| +----------------------------------------------------------+ |
| | - Answer questions for each pillar | |
| | - Provide evidence | |
| | - Note risks and improvements | |
| +----------------------------------------------------------+ |
| | |
| v |
| Step 3: Review Results |
| +----------------------------------------------------------+ |
| | | |
| | Pillar Scores: | |
| | +----------------+--------+ | |
| | | Security | 85/100 | | |
| | | Reliability | 72/100 | <-- Needs improvement | |
| | | Performance | 90/100 | | |
| | | Cost | 65/100 | <-- Needs improvement | |
| | | Sustainability | 78/100 | | |
| | +----------------+--------+ | |
| +----------------------------------------------------------+ |
| | |
| v |
| Step 4: Create Improvement Plan |
| +----------------------------------------------------------+ |
| | - Prioritize high-risk items | |
| | - Create milestones | |
| | - Track progress | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
PillarSample Question
SecurityHow are you protecting access to your workload?
ReliabilityHow does your workload handle failure?
PerformanceHow do you select your compute solution?
CostDo you have cost controls in place?
SustainabilityHow do you track and measure sustainability?

Architecture Decision Record Template
+------------------------------------------------------------------+
| |
| ADR-001: Use Multi-AZ RDS for Database High Availability |
| |
| Status: Accepted |
| |
| Context: |
| +----------------------------------------------------------+ |
| | - Application requires 99.99% availability | |
| | - Database is critical component | |
| | - Single AZ deployment has 99.95% availability | |
| +----------------------------------------------------------+ |
| |
| Decision: |
| +----------------------------------------------------------+ |
| | - Deploy RDS in Multi-AZ configuration | |
| | - Use synchronous replication | |
| | - Automatic failover enabled | |
| +----------------------------------------------------------+ |
| |
| Consequences: |
| +----------------------------------------------------------+ |
| | Positive: | |
| | - Higher availability (99.99%) | |
| | - Automatic failover | |
| | - No manual intervention | |
| | | |
| | Negative: | |
| | - Higher cost (~2x single AZ) | |
| | - Slight write latency increase | |
| +----------------------------------------------------------+ |
| |
| Alternatives Considered: |
| +----------------------------------------------------------+ |
| | 1. Single AZ with read replicas - Lower availability | |
| | 2. Self-managed database - Higher operational overhead | |
| | 3. Multi-region - Higher cost, complexity | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

Well-Architected Best Practices
+------------------------------------------------------------------+
| |
| 1. Regular Reviews |
| +----------------------------------------------------------+ |
| | - Conduct Well-Architected reviews quarterly | |
| | - Use AWS Well-Architected Tool | |
| | - Document and track improvements | |
| +----------------------------------------------------------+ |
| |
| 2. Balance Pillars |
| +----------------------------------------------------------+ |
| | - Trade-offs between pillars are normal | |
| | - Document decisions | |
| | - Align with business requirements | |
| +----------------------------------------------------------+ |
| |
| 3. Iterate |
| +----------------------------------------------------------+ |
| | - Architecture evolves over time | |
| | - Continuous improvement | |
| | - Learn from incidents | |
| +----------------------------------------------------------+ |
| |
| 4. Automate |
| +----------------------------------------------------------+ |
| | - Infrastructure as Code | |
| | - Automated testing | |
| | - Automated deployments | |
| +----------------------------------------------------------+ |
| |
| 5. Measure |
| +----------------------------------------------------------+ |
| | - Define metrics for each pillar | |
| | - Set up monitoring and alerting | |
| | - Regular reporting | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

The Well-Architected Framework is your architecture decision compass. As a DevOps/SRE, you’ll reference it during architecture reviews, incident postmortems, and capacity planning.

WAF in DevOps Daily Work
+------------------------------------------------------------------+
| |
| How Each Pillar Maps to DevOps Responsibilities: |
| |
| Security → IAM policies, secrets management, auditing |
| Reliability → HA design, DR planning, chaos engineering |
| Performance → Right-sizing, caching, load testing |
| Cost Optimiz. → FinOps, reserved capacity, right-sizing |
| Sustainability → Graviton adoption, efficient architectures |
| |
| When to Apply WAF: |
| +----------------------------------------------------------+ |
| | - Before launching new services (design review) | |
| | - After incidents (postmortem analysis) | |
| | - Quarterly architecture reviews | |
| | - During major migrations or refactoring | |
| | - When optimizing recurring costs | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

Architecture Review Automation from Arch Linux

Section titled “Architecture Review Automation from Arch Linux”
Terminal window
# Install AWS Well-Architected Tool CLI on Arch Linux
sudo pacman -S aws-cli-v2 jq
# List existing workload reviews
aws wellarchitected list-workloads \
--query 'WorkloadSummaries[*].[WorkloadName,RiskCounts]' \
--output table
# Create automated architecture checklist script
#!/bin/bash
# /usr/local/bin/wa-quick-check.sh
# Quick Well-Architected check for a deployment
set -euo pipefail
REGION=${1:-us-east-1}
echo "=== Quick Well-Architected Review — Region: $REGION ==="
echo ""
# RELIABILITY: Check Multi-AZ deployments
echo "--- Reliability: Multi-AZ Check ---"
SINGLE_AZ_INSTANCES=$(aws ec2 describe-instances \
--region "$REGION" \
--filters Name=instance-state-name,Values=running \
--query 'Reservations[*].Instances[*].Placement.AvailabilityZone' \
--output text | sort | uniq -c | sort -rn)
echo "Instance distribution by AZ:"
echo "$SINGLE_AZ_INSTANCES"
# RELIABILITY: Check for single-AZ RDS
echo ""
echo "--- Reliability: RDS Multi-AZ Check ---"
aws rds describe-db-instances \
--region "$REGION" \
--query 'DBInstances[?MultiAZ==`false`].[DBInstanceIdentifier,Engine,DBInstanceClass]' \
--output table 2>/dev/null || echo "No RDS instances found"
# SECURITY: Check for public S3 buckets
echo ""
echo "--- Security: Public S3 Buckets Check ---"
for bucket in $(aws s3api list-buckets --query 'Buckets[*].Name' --output text); do
public=$(aws s3api get-public-access-block --bucket "$bucket" 2>/dev/null | \
jq '.PublicAccessBlockConfiguration | all' 2>/dev/null || echo "UNKNOWN")
if [ "$public" != "true" ]; then
echo "⚠️ Bucket $bucket may be public"
fi
done
# COST: Check for unattached EBS volumes
echo ""
echo "--- Cost: Unattached EBS Volumes ---"
aws ec2 describe-volumes \
--region "$REGION" \
--filters Name=status,Values=available \
--query 'Volumes[*].[VolumeId,Size,VolumeType]' \
--output table
# PERFORMANCE: Check instance utilization
echo ""
echo "--- Performance: Low CPU Instances (< 10% avg) ---"
for instance in $(aws ec2 describe-instances \
--region "$REGION" \
--filters Name=instance-state-name,Values=running \
--query 'Reservations[*].Instances[*].InstanceId' \
--output text); do
cpu=$(aws cloudwatch get-metric-statistics \
--region "$REGION" \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value="$instance" \
--start-time $(date -u -d '7 days ago' '+%Y-%m-%dT%H:%M:%S') \
--end-time $(date -u '+%Y-%m-%dT%H:%M:%S') \
--period 604800 \
--statistics Average \
--query 'Datapoints[0].Average' \
--output text 2>/dev/null)
if [ "$cpu" != "None" ] && (( $(echo "$cpu < 10" | bc -l 2>/dev/null || echo 0) )); then
echo "⚠️ Instance $instance — Avg CPU: ${cpu}%"
fi
done

Scenario: Post-Incident Architecture Review

Section titled “Scenario: Post-Incident Architecture Review”
WAF-Based Incident Postmortem
+------------------------------------------------------------------+
| |
| Incident: Database failover took 15 minutes instead of <1 min |
| |
| WAF Pillar Analysis: |
| |
| Reliability: |
| +----------------------------------------------------------+ |
| | ❌ RDS was NOT Multi-AZ (single AZ deployment) | |
| | ❌ No automated failover target | |
| | ❌ Manual intervention required for recovery | |
| | ✅ Fix: Enable Multi-AZ, test failover quarterly | |
| +----------------------------------------------------------+ |
| |
| Performance: |
| +----------------------------------------------------------+ |
| | ❌ No read replicas for read-heavy queries | |
| | ❌ Application reconnection logic had 60s timeout | |
| | ✅ Fix: Add read replica, reduce connection timeout | |
| +----------------------------------------------------------+ |
| |
| Security: |
| +----------------------------------------------------------+ |
| | ✅ Encryption at rest was enabled | |
| | ✅ No public accessibility | |
| +----------------------------------------------------------+ |
| |
| Cost: |
| +----------------------------------------------------------+ |
| | Multi-AZ adds ~$150/month | |
| | Read replica adds ~$100/month | |
| | vs. 15 min outage cost: ~$50,000 in lost revenue | |
| | ROI: Investment pays off in <1 incident | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

IssueCauseSolution
WAF Tool shows high-risk itemsMissing best practicesCreate improvement plan, prioritize by impact
Cannot create workload in WAF ToolMissing IAM permissionsNeed wellarchitected:* permissions
Pillar scores not improvingFixes not verifiedRun WAF review after implementing changes
Trade-off between pillarsOptimization in one affects anotherDocument trade-offs in ADRs (Architecture Decision Records)
Team not following best practicesNo enforcement mechanismAutomate checks in CI/CD pipeline

Architecture Anti-Patterns
+------------------------------------------------------------------+
| |
| ❌ Mistake 1: Reviewing Architecture Only Once |
| +----------------------------------------------------------+ |
| | Problem: WAF review done at launch, never revisited | |
| | Impact: Architecture drift, accumulating tech debt | |
| | Fix: Quarterly reviews, after major incidents | |
| +----------------------------------------------------------+ |
| |
| ❌ Mistake 2: Optimizing for One Pillar Only |
| +----------------------------------------------------------+ |
| | Problem: Over-optimizing cost at expense of reliability | |
| | Impact: Single-AZ deployments, no backups | |
| | Fix: Balance all five pillars, document trade-offs | |
| +----------------------------------------------------------+ |
| |
| ❌ Mistake 3: No Disaster Recovery Testing |
| +----------------------------------------------------------+ |
| | Problem: DR plan exists but never tested | |
| | Impact: Failover fails when actually needed | |
| | Fix: Quarterly Game Days, automated DR testing | |
| +----------------------------------------------------------+ |
| |
| ❌ Mistake 4: Skipping Architecture Decision Records |
| +----------------------------------------------------------+ |
| | Problem: No documentation of why decisions were made | |
| | Impact: Repeated debates, knowledge loss when people leave| |
| | Fix: ADR for every significant architecture decision | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

  1. Q: Name the five pillars of the Well-Architected Framework.

    • A: Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. Each pillar provides best practices and design principles for building well-architected systems.
  2. Q: What’s the difference between RPO and RTO?

    • A: RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time. RTO (Recovery Time Objective) is the maximum acceptable downtime. Example: RPO of 1 hour means you can lose at most 1 hour of data; RTO of 30 minutes means you must be back online within 30 minutes.
  3. Q: Explain the four DR strategies in order of cost and recovery time.

    • A: (1) Backup & Restore: cheapest, hours to recover. (2) Pilot Light: core infrastructure running, minutes to scale up. (3) Warm Standby: scaled-down copy running, minutes to scale to full. (4) Active-Active: full copies in multiple regions, near-zero downtime; most expensive.
  1. Q: Your application needs 99.99% availability. How do you architect it?

    • A: Multi-AZ deployment across at least 3 AZs. ALB for traffic distribution. Auto Scaling for self-healing. Multi-AZ RDS with automated failover. Route 53 health checks. Consider multi-region warm standby for DR. 99.99% allows only 52.6 minutes downtime per year.
  2. Q: How do you choose between serverless and containers for a new service?

    • A: Consider: (1) Request pattern — sporadic/bursty favors Lambda, sustained favors containers, (2) Execution time — Lambda has 15-min limit, (3) Cold start tolerance, (4) Cost at scale — Lambda is cheaper at low volumes, containers at high volumes, (5) Team expertise, (6) Vendor lock-in tolerance.

Exam Tip

  1. Five Pillars: Security, Reliability, Performance, Cost, Sustainability
  2. Trade-offs: Understand how decisions affect multiple pillars
  3. Design Principles: Know the principles for each pillar
  4. Well-Architected Tool: Use for architecture reviews
  5. RPO/RTO: Know the difference and how they affect DR strategy
  6. Right-Sizing: Key for both cost and performance optimization
  7. Defense in Depth: Security approach with multiple layers
  8. Serverless: Often the best choice for performance and cost

Chapter 6: Amazon EC2 - Deep Dive


Last Updated: March 2026

Last Updated: February 2026