AWS Well-Architected Framework

Chapter 5: AWS Well-Architected Framework

Building Secure, High-Performing, Resilient, and Efficient Infrastructure

5.1 Overview

The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building systems on AWS, providing a consistent approach to evaluate architectures.

                    Well-Architected Framework Pillars
+------------------------------------------------------------------+
|                                                                   |
|                    +------------------------+                     |
|                    |  Well-Architected      |                     |
|                    |      Framework         |                     |
|                    +------------------------+                     |
|                              |                                    |
|     +-----------+-----------+-----------+-----------+            |
|     |           |           |           |           |            |
|     v           v           v           v           v            |
| +-------+  +-------+  +-------+  +-------+  +-------+          |
| |Security|  |Reliabil|  |Perform-|  |Cost    |  |Sustain-|          |
| |        |  |  ity   |  | ance  |  |Optimiz-|  | ability|          |
| |        |  |        |  |        |  | ation  |  |        |          |
| +-------+  +-------+  +-------+  +-------+  +-------+          |
|                                                                   |
|    1. Security         2. Reliability      3. Performance        |
|    4. Cost Optimization                   5. Sustainability     |
|                                                                   |
+------------------------------------------------------------------+

5.2 Pillar 1: Security

Security Design Principles

                    Security Pillar Principles
+------------------------------------------------------------------+
|                                                                   |
|    1. Implement a Strong Identity Foundation                      |
|    +----------------------------------------------------------+   |
|    |  - Centralize identity management                         |   |
|    |  - Use IAM for access control                             |   |
|    |  - Implement least privilege                              |   |
|    |  - Enforce MFA                                            |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    2. Enable Traceability                                        |
|    +----------------------------------------------------------+   |
|    |  - Monitor and log all actions                            |   |
|    |  - Use CloudTrail for API auditing                        |   |
|    |  - Implement alerting                                     |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    3. Apply Security at All Layers                               |
|    +----------------------------------------------------------+   |
|    |  - Defense in depth                                       |   |
|    |  - Network security (VPC, NACLs, SGs)                     |   |
|    |  - Application security                                   |   |
|    |  - Data encryption                                        |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    4. Automate Security Best Practices                           |
|    +----------------------------------------------------------+   |
|    |  - Use managed services                                   |   |
|    |  - Automated patching                                      |   |
|    |  - Security as code                                       |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    5. Protect Data in Transit and at Rest                        |
|    +----------------------------------------------------------+   |
|    |  - Encryption everywhere                                  |   |
|    |  - TLS for transit                                        |   |
|    |  - KMS for key management                                 |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    6. Prepare for Security Events                                |
|    +----------------------------------------------------------+   |
|    |  - Incident response plan                                 |   |
|    |  - Automated response (GuardDuty)                         |   |
|    |  - Regular security testing                                |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

Security Architecture

                    Security Defense in Depth
+------------------------------------------------------------------+
|                                                                   |
|    Layer 1: Edge Security                                         |
|    +----------------------------------------------------------+   |
|    |  +----------+  +----------+  +----------+                |   |
|    |  |CloudFront|  |   WAF    |  |  Shield  |                |   |
|    |  |  (CDN)   |  |(Firewall)|  |  (DDoS)  |                |   |
|    |  +----------+  +----------+  +----------+                |   |
|    +----------------------------------------------------------+   |
|                              |                                    |
|                              v                                    |
|    Layer 2: Network Security                                      |
|    +----------------------------------------------------------+   |
|    |  +----------+  +----------+  +----------+                |   |
|    |  |   VPC    |  |  NACLs   |  |Security  |                |   |
|    |  |          |  |          |  | Groups   |                |   |
|    |  +----------+  +----------+  +----------+                |   |
|    +----------------------------------------------------------+   |
|                              |                                    |
|                              v                                    |
|    Layer 3: Compute Security                                      |
|    +----------------------------------------------------------+   |
|    |  +----------+  +----------+  +----------+                |   |
|    |  |   EC2    |  | Systems  |  |  Guard   |                |   |
|    |  |  (IAM)   |  | Manager  |  |  Duty    |                |   |
|    |  +----------+  +----------+  +----------+                |   |
|    +----------------------------------------------------------+   |
|                              |                                    |
|                              v                                    |
|    Layer 4: Data Security                                         |
|    +----------------------------------------------------------+   |
|    |  +----------+  +----------+  +----------+                |   |
|    |  |   KMS    |  |   S3     |  |   RDS    |                |   |
|    |  |(Encrypt) |  |(Encrypt) |  |(Encrypt) |                |   |
|    |  +----------+  +----------+  +----------+                |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

Security Checklist

Question	Best Practice	AWS Service
How are you managing identities?	Centralized IAM, SSO	IAM, AWS SSO
How are you controlling access?	Least privilege, MFA	IAM Policies
How are you protecting network?	VPC, Security Groups	VPC, NACLs, SGs
How are you encrypting data?	Encryption at rest and transit	KMS, ACM
How are you monitoring?	Logging, alerting	CloudTrail, CloudWatch
How are you responding?	Automated response	GuardDuty, Security Hub

5.3 Pillar 2: Reliability

Reliability Design Principles

                    Reliability Pillar Principles
+------------------------------------------------------------------+
|                                                                   |
|    1. Automatically Recover from Failure                          |
|    +----------------------------------------------------------+   |
|    |  - Implement self-healing                                |   |
|    |  - Use Auto Scaling                                       |   |
|    |  - Multi-AZ deployments                                   |   |
|    |  - Health checks                                          |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    2. Test Recovery Procedures                                    |
|    +----------------------------------------------------------+   |
|    |  - Regular disaster recovery testing                      |   |
|    |  - Chaos engineering                                      |   |
|    |  - Game days                                              |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    3. Scale Horizontally                                          |
|    +----------------------------------------------------------+   |
|    |  - Distribute load across resources                       |   |
|    |  - Avoid single points of failure                         |   |
|    |  - Use load balancers                                     |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    4. Stop Guessing Capacity                                      |
|    +----------------------------------------------------------+   |
|    |  - Use Auto Scaling                                        |   |
|    |  - Serverless where possible                              |   |
|    |  - Monitor and adjust                                     |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    5. Automate Change Management                                  |
|    +----------------------------------------------------------+   |
|    |  - Infrastructure as Code                               |   |
|    |  - Automated deployments                                   |   |
|    |  - Blue/green deployments                                 |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

High Availability Architecture

                    Multi-AZ High Availability
+------------------------------------------------------------------+
|                                                                   |
|                         Internet                                  |
|                            |                                      |
|                            v                                      |
|                    +---------------+                              |
|                    |  Route 53     |                              |
|                    |  (DNS)        |                              |
|                    +---------------+                              |
|                            |                                      |
|                            v                                      |
|                    +---------------+                              |
|                    |  CloudFront   |                              |
|                    |  (CDN)        |                              |
|                    +---------------+                              |
|                            |                                      |
|                            v                                      |
|            +-----------------------------------+                  |
|            |      Application Load Balancer    |                  |
|            |         (Multi-AZ)               |                  |
|            +-----------------------------------+                  |
|                |               |               |                  |
|                v               v               v                  |
|         +----------+    +----------+    +----------+              |
|         |   AZ-A   |    |   AZ-B   |    |   AZ-C   |              |
|         |          |    |          |    |          |              |
|         | +------+ |    | +------+ |    | +------+ |              |
|         | | EC2  | |    | | EC2  | |    | | EC2  | |              |
|         | | Fleet| |    | | Fleet| |    | | Fleet| |              |
|         | +------+ |    | +------+ |    | +------+ |              |
|         |          |    |          |    |          |              |
|         | +------+ |    | +------+ |    | +------+ |              |
|         | | RDS  | |    | | RDS  | |    | | RDS  | |              |
|         | |Primary| |   | |Replica| |   | |Replica| |              |
|         | +------+ |    | +------+ |    | +------+ |              |
|         +----------+    +----------+    +----------+              |
|                                                                   |
|        Availability: 99.99% (52.6 min downtime/year)             |
|                                                                   |
+------------------------------------------------------------------+

Disaster Recovery Patterns

                    Disaster Recovery Strategies
+------------------------------------------------------------------+
|                                                                   |
|    Strategy 1: Backup & Restore                                   |
|    +----------------------------------------------------------+   |
|    |  RPO: Hours              RTO: Hours                       |   |
|    |                                                          |   |
|    |  Primary Region              Backup Region                |   |
|    |  +----------+               +----------+                 |   |
|    |  |   App    |               |   S3     |                 |   |
|    |  |          | --backup----> | Backups  |                 |   |
|    |  |   DB     |               |          |                 |   |
|    |  +----------+               +----------+                 |   |
|    |                                   |                       |   |
|    |                                   v (restore)             |   |
|    |                            +----------+                   |   |
|    |                            |   App    |                   |   |
|    |                            |   DB     |                   |   |
|    |                            +----------+                   |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Strategy 2: Pilot Light                                        |
|    +----------------------------------------------------------+   |
|    |  RPO: Minutes            RTO: Minutes                     |   |
|    |                                                          |   |
|    |  Primary Region              DR Region                    |   |
|    |  +----------+               +----------+                 |   |
|    |  |   App    |               |   DB     |                 |   |
|    |  |          | --repl------> |(Standby) |                 |   |
|    |  |   DB     |               |          |                 |   |
|    |  +----------+               +----------+                 |   |
|    |                                   |                       |   |
|    |                                   v (scale up)            |   |
|    |                            +----------+                   |   |
|    |                            |   App    |                   |   |
|    |                            +----------+                   |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Strategy 3: Warm Standby                                       |
|    +----------------------------------------------------------+   |
|    |  RPO: Minutes            RTO: Minutes                     |   |
|    |                                                          |   |
|    |  Primary Region              DR Region                    |   |
|    |  +----------+               +----------+                 |   |
|    |  |   App    |               |   App    |                 |   |
|    |  | (Full)   | --repl------> |(Scaled-  |                 |   |
|    |  |   DB     |               | down)    |                 |   |
|    |  +----------+               |   DB     |                 |   |
|    |                             +----------+                 |   |
|    |                                   |                       |   |
|    |                                   v (scale up)            |   |
|    |                            +----------+                   |   |
|    |                            |   App    |                   |   |
|    |                            | (Full)   |                   |   |
|    |                            +----------+                   |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Strategy 4: Multi-Region Active-Active                        |
|    +----------------------------------------------------------+   |
|    |  RPO: Real-time         RTO: Real-time                    |   |
|    |                                                          |   |
|    |  Region A                    Region B                     |   |
|    |  +----------+               +----------+                 |   |
|    |  |   App    |               |   App    |                 |   |
|    |  | (Active) | <---sync---> | (Active) |                 |   |
|    |  |   DB     |               |   DB     |                 |   |
|    |  +----------+               +----------+                 |   |
|    |                                                          |   |
|    |  Route 53 routes traffic to both regions                 |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

Reliability Metrics

Metric	Definition	Target
RPO	Recovery Point Objective - max data loss	Minutes to hours
RTO	Recovery Time Objective - max downtime	Minutes to hours
MTTR	Mean Time To Recovery	Minimize
MTTF	Mean Time To Failure	Maximize
Availability	Uptime percentage	99.9% - 99.999%

5.4 Pillar 3: Performance Efficiency

Performance Design Principles

                    Performance Pillar Principles
+------------------------------------------------------------------+
|                                                                   |
|    1. Democratize Advanced Technologies                           |
|    +----------------------------------------------------------+   |
|    |  - Use managed services                                   |   |
|    |  - Let AWS handle complexity                              |   |
|    |  - Focus on business logic                                |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    2. Go Global in Minutes                                        |
|    +----------------------------------------------------------+   |
|    |  - Deploy to multiple regions                             |   |
|    |  - Use CloudFront for global reach                        |   |
|    |  - Edge locations for low latency                         |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    3. Use Serverless Architectures                               |
|    +----------------------------------------------------------+   |
|    |  - Lambda for compute                                     |   |
|    |  - DynamoDB for database                                  |   |
|    |  - S3 for storage                                         |   |
|    |  - No server management                                    |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    4. Experiment More Often                                        |
|    +----------------------------------------------------------+   |
|    |  - Quick provisioning                                     |   |
|    |  - Test different configurations                          |   |
|    |  - A/B testing                                            |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    5. Consider Mechanical Sympathy                                |
|    +----------------------------------------------------------+   |
|    |  - Choose right instance types                            |   |
|    |  - Optimize for workload                                  |   |
|    |  - Use appropriate storage types                          |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

Performance Architecture Patterns

                    Performance Optimization Layers
+------------------------------------------------------------------+
|                                                                   |
|    Layer 1: Caching                                               |
|    +----------------------------------------------------------+   |
|    |                                                          |   |
|    |  Client Cache -> CDN Cache -> App Cache -> DB Cache       |   |
|    |      |           |            |            |              |   |
|    |      v           v            v            v              |   |
|    |  Browser    CloudFront    ElastiCache   RDS/DB            |   |
|    |                                                          |   |
|    |  Benefits:                                                |   |
|    |    - Reduced latency                                      |   |
|    |    - Lower database load                                   |   |
|    |    - Better user experience                               |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Layer 2: Compute Optimization                                  |
|    +----------------------------------------------------------+   |
|    |                                                          |   |
|    |  Workload Type          Recommended Service              |   |
|    |  +----------------+-------------------+                  |   |
|    |  | Web Servers    | EC2, ALB, ASG    |                  |   |
|    |  | API Backend    | Lambda, API GW   |                  |   |
|    |  | Batch Jobs     | Batch, Lambda    |                  |   |
|    |  | Containers     | ECS, EKS, Fargate|                 |   |
|    |  | ML/AI          | SageMaker        |                  |   |
|    |  +----------------+-------------------+                  |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Layer 3: Database Optimization                                 |
|    +----------------------------------------------------------+   |
|    |                                                          |   |
|    |  Data Pattern            Recommended Database            |   |
|    |  +----------------+-------------------+                  |   |
|    |  | Relational     | RDS, Aurora      |                  |   |
|    |  | Key-Value      | DynamoDB         |                  |   |
|    |  | Document       | DocumentDB       |                  |   |
|    |  | Graph          | Neptune         |                  |   |
|    |  | Time Series    | Timestream      |                  |   |
|    |  | In-Memory      | ElastiCache     |                  |   |
|    |  +----------------+-------------------+                  |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

Performance Monitoring

                    Performance Monitoring Stack
+------------------------------------------------------------------+
|                                                                   |
|                    +------------------------+                     |
|                    |   CloudWatch Dashboard  |                     |
|                    +------------------------+                     |
|                              |                                    |
|        +---------------------+---------------------+              |
|        |                     |                     |              |
|        v                     v                     v              |
|  +----------+          +----------+          +----------+         |
|  | Metrics  |          |  Logs    |          |  Traces  |         |
|  |          |          |          |          |          |         |
|  |CloudWatch|          |CloudWatch|          |  X-Ray   |         |
|  | Metrics  |          |  Logs    |          |          |         |
|  +----------+          +----------+          +----------+         |
|        |                     |                     |              |
|        v                     v                     v              |
|  +----------+          +----------+          +----------+         |
|  | Alarms   |          | Insights |          | Service  |         |
|  |          |          |          |          |   Map    |         |
|  +----------+          +----------+          +----------+         |
|                                                                   |
|    Key Metrics to Monitor:                                        |
|    +----------------------------------------------------------+   |
|    |  - CPU Utilization                                        |   |
|    |  - Memory Utilization                                     |   |
|    |  - Disk I/O                                               |   |
|    |  - Network Throughput                                     |   |
|    |  - Request Latency                                        |   |
|    |  - Error Rates                                            |   |
|    |  - Queue Depth                                            |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

5.5 Pillar 4: Cost Optimization

Cost Design Principles

                    Cost Optimization Pillar Principles
+------------------------------------------------------------------+
|                                                                   |
|    1. Implement Cloud Financial Management                        |
|    +----------------------------------------------------------+   |
|    |  - Establish cost awareness                               |   |
|    |  - Set budgets and alerts                                 |   |
|    |  - Regular cost reviews                                   |   |
|    |  - FinOps practices                                       |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    2. Adopt a Consumption Model                                   |
|    +----------------------------------------------------------+   |
|    |  - Pay for what you use                                   |   |
|    |  - Scale up and down                                      |   |
|    |  - No upfront commitments for variable workloads          |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    3. Measure Overall Efficiency                                  |
|    +----------------------------------------------------------+   |
|    |  - Track business metrics                                 |   |
|    |  - Cost per transaction                                   |   |
|    |  - Cost per customer                                      |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    4. Stop Spending Money on Undifferentiated Heavy Lifting       |
|    +----------------------------------------------------------+   |
|    |  - Use managed services                                   |   |
|    |  - Focus on competitive advantage                         |   |
|    |  - Let AWS manage infrastructure                          |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    5. Analyze and Attribute Expenditure                          |
|    +----------------------------------------------------------+   |
|    |  - Tag resources                                          |   |
|    |  - Cost allocation                                        |   |
|    |  - Chargeback/showback                                    |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

Cost Optimization Strategies

                    Cost Optimization Techniques
+------------------------------------------------------------------+
|                                                                   |
|    Technique 1: Right-Sizing                                      |
|    +----------------------------------------------------------+   |
|    |                                                          |   |
|    |  Over-provisioned         Right-sized                    |   |
|    |  +----------------+       +----------------+             |   |
|    |  | m5.2xlarge    |       | m5.large       |             |   |
|    |  | CPU: 15%      |  -->  | CPU: 60%       |             |   |
|    |  | Memory: 20%   |       | Memory: 70%    |             |   |
|    |  | Cost: $280/mo |       | Cost: $70/mo   |             |   |
|    |  +----------------+       +----------------+             |   |
|    |                                                          |   |
|    |  Tools: Compute Optimizer, Cost Explorer                 |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Technique 2: Reserved Capacity                                 |
|    +----------------------------------------------------------+   |
|    |                                                          |   |
|    |  Pricing Model         Discount        Commitment        |   |
|    |  +----------------+----------------+-------------+       |   |
|    |  | On-Demand       | 0%             | None         |       |   |
|    |  | RI (1 year)     | 30-40%         | 1 year       |       |   |
|    |  | RI (3 year)     | 50-60%         | 3 years      |       |   |
|    |  | Savings Plans   | Up to 72%      | 1-3 years    |       |   |
|    |  +----------------+----------------+-------------+       |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Technique 3: Spot Instances                                   |
|    +----------------------------------------------------------+   |
|    |                                                          |   |
|    |  Use Case               Spot Discount                    |   |
|    |  +----------------+----------------+                     |   |
|    |  | Batch Jobs     | Up to 90% off  |                     |   |
|    |  | CI/CD          | Up to 90% off  |                     |   |
|    |  | Big Data       | Up to 90% off  |                     |   |
|    |  | Containerized  | Up to 90% off  |                     |   |
|    |  +----------------+----------------+                     |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Technique 4: Storage Tiering                                   |
|    +----------------------------------------------------------+   |
|    |                                                          |   |
|    |  Data Age              Storage Tier      Cost            |   |
|    |  +----------------+----------------+-------------+       |   |
|    |  | Hot (0-30 days) | S3 Standard     | $0.023/GB   |       |   |
|    |  | Warm (30-90)    | S3 Standard-IA  | $0.0125/GB  |       |   |
|    |  | Cold (90-180)   | S3 Glacier      | $0.004/GB   |       |   |
|    |  | Archive (180+)  | S3 Glacier Deep | $0.00099/GB |       |   |
|    |  +----------------+----------------+-------------+       |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

5.6 Pillar 5: Sustainability

Sustainability Design Principles

                    Sustainability Pillar Principles
+------------------------------------------------------------------+
|                                                                   |
|    1. Understand Your Impact                                       |
|    +----------------------------------------------------------+   |
|    |  - Measure sustainability metrics                         |   |
|    |  - Track carbon footprint                                 |   |
|    |  - Set improvement goals                                  |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    2. Establish Sustainability Goals                              |
|    +----------------------------------------------------------+   |
|    |  - Define targets                                         |   |
|    |  - Align with business objectives                         |   |
|    |  - Regular reviews                                        |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    3. Maximize Utilization                                         |
|    +----------------------------------------------------------+   |
|    |  - Right-size resources                                   |   |
|    |  - Use serverless                                         |   |
|    |  - Optimize workload scheduling                           |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    4. Anticipate and Adopt New Hardware                           |
|    +----------------------------------------------------------+   |
|    |  - Use latest instance generations                        |   |
|    |  - Leverage AWS efficiency improvements                   |   |
|    |  - Migrate to more efficient services                     |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    5. Use Managed Services                                         |
|    +----------------------------------------------------------+   |
|    |  - AWS manages at scale                                   |   |
|    |  - Higher efficiency                                      |   |
|    |  - Shared infrastructure                                  |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    6. Reduce Downstream Impact                                     |
|    +----------------------------------------------------------+   |
|    |  - Optimize data transfer                                |   |
|    |  - Reduce storage requirements                            |   |
|    |  - Efficient algorithms                                   |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

Sustainability Best Practices

                    Sustainability Optimization
+------------------------------------------------------------------+
|                                                                   |
|    Compute Optimization                                           |
|    +----------------------------------------------------------+   |
|    |  - Use Graviton (ARM) instances - 60% more efficient     |   |
|    |  - Opt for serverless (Lambda, Fargate)                   |   |
|    |  - Use Spot instances for batch workloads                 |   |
|    |  - Implement auto-scaling                                 |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Storage Optimization                                           |
|    +----------------------------------------------------------+   |
|    |  - Use S3 Intelligent-Tiering                            |   |
|    |  - Implement lifecycle policies                           |   |
|    |  - Compress data before storage                           |   |
|    |  - Delete unused snapshots                                |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Network Optimization                                           |
|    +----------------------------------------------------------+   |
|    |  - Use CloudFront to reduce origin requests               |   |
|    |  - Implement caching                                      |   |
|    |  - Use VPC endpoints                                      |   |
|    |  - Optimize data transfer patterns                        |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Region Selection                                               |
|    +----------------------------------------------------------+   |
|    |  - Choose regions with lower carbon intensity             |   |
|    |  - Consider regions powered by renewable energy           |   |
|    |  - Balance latency with sustainability                    |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

5.7 Well-Architected Tool

Using the AWS Well-Architected Tool

                    Well-Architected Tool Workflow
+------------------------------------------------------------------+
|                                                                   |
|    Step 1: Define Workload                                        |
|    +----------------------------------------------------------+   |
|    |  - Name your workload                                     |   |
|    |  - Select region                                          |   |
|    |  - Define scope                                           |   |
|    +----------------------------------------------------------+   |
|                              |                                    |
|                              v                                    |
|    Step 2: Answer Questions                                       |
|    +----------------------------------------------------------+   |
|    |  - Answer questions for each pillar                       |   |
|    |  - Provide evidence                                       |   |
|    |  - Note risks and improvements                            |   |
|    +----------------------------------------------------------+   |
|                              |                                    |
|                              v                                    |
|    Step 3: Review Results                                         |
|    +----------------------------------------------------------+   |
|    |                                                          |   |
|    |  Pillar Scores:                                          |   |
|    |  +----------------+--------+                             |   |
|    |  | Security       | 85/100 |                             |   |
|    |  | Reliability    | 72/100 |  <-- Needs improvement     |   |
|    |  | Performance    | 90/100 |                             |   |
|    |  | Cost           | 65/100 |  <-- Needs improvement     |   |
|    |  | Sustainability | 78/100 |                             |   |
|    |  +----------------+--------+                             |   |
|    +----------------------------------------------------------+   |
|                              |                                    |
|                              v                                    |
|    Step 4: Create Improvement Plan                               |
|    +----------------------------------------------------------+   |
|    |  - Prioritize high-risk items                            |   |
|    |  - Create milestones                                      |   |
|    |  - Track progress                                         |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

Sample Questions by Pillar

Pillar	Sample Question
Security	How are you protecting access to your workload?
Reliability	How does your workload handle failure?
Performance	How do you select your compute solution?
Cost	Do you have cost controls in place?
Sustainability	How do you track and measure sustainability?

5.8 Architecture Decision Records

Documenting Architecture Decisions

                    Architecture Decision Record Template
+------------------------------------------------------------------+
|                                                                   |
|    ADR-001: Use Multi-AZ RDS for Database High Availability       |
|                                                                   |
|    Status: Accepted                                               |
|                                                                   |
|    Context:                                                       |
|    +----------------------------------------------------------+   |
|    |  - Application requires 99.99% availability              |   |
|    |  - Database is critical component                        |   |
|    |  - Single AZ deployment has 99.95% availability           |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Decision:                                                      |
|    +----------------------------------------------------------+   |
|    |  - Deploy RDS in Multi-AZ configuration                  |   |
|    |  - Use synchronous replication                           |   |
|    |  - Automatic failover enabled                            |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Consequences:                                                  |
|    +----------------------------------------------------------+   |
|    |  Positive:                                               |   |
|    |    - Higher availability (99.99%)                        |   |
|    |    - Automatic failover                                   |   |
|    |    - No manual intervention                               |   |
|    |                                                          |   |
|    |  Negative:                                               |   |
|    |    - Higher cost (~2x single AZ)                          |   |
|    |    - Slight write latency increase                        |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Alternatives Considered:                                       |
|    +----------------------------------------------------------+   |
|    |  1. Single AZ with read replicas - Lower availability     |   |
|    |  2. Self-managed database - Higher operational overhead   |   |
|    |  3. Multi-region - Higher cost, complexity               |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

5.9 Best Practices Summary

                    Well-Architected Best Practices
+------------------------------------------------------------------+
|                                                                   |
|    1. Regular Reviews                                             |
|    +----------------------------------------------------------+   |
|    |  - Conduct Well-Architected reviews quarterly            |   |
|    |  - Use AWS Well-Architected Tool                         |   |
|    |  - Document and track improvements                       |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    2. Balance Pillars                                             |
|    +----------------------------------------------------------+   |
|    |  - Trade-offs between pillars are normal                 |   |
|    |  - Document decisions                                     |   |
|    |  - Align with business requirements                       |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    3. Iterate                                                     |
|    +----------------------------------------------------------+   |
|    |  - Architecture evolves over time                         |   |
|    |  - Continuous improvement                                 |   |
|    |  - Learn from incidents                                   |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    4. Automate                                                    |
|    +----------------------------------------------------------+   |
|    |  - Infrastructure as Code                                 |   |
|    |  - Automated testing                                       |   |
|    |  - Automated deployments                                   |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    5. Measure                                                     |
|    +----------------------------------------------------------+   |
|    |  - Define metrics for each pillar                         |   |
|    |  - Set up monitoring and alerting                         |   |
|    |  - Regular reporting                                       |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

5.10 Why This Matters in DevOps/SRE

The Well-Architected Framework is your architecture decision compass. As a DevOps/SRE, you’ll reference it during architecture reviews, incident postmortems, and capacity planning.

                    WAF in DevOps Daily Work
+------------------------------------------------------------------+
|                                                                   |
|    How Each Pillar Maps to DevOps Responsibilities:               |
|                                                                   |
|    Security        → IAM policies, secrets management, auditing  |
|    Reliability     → HA design, DR planning, chaos engineering   |
|    Performance     → Right-sizing, caching, load testing         |
|    Cost Optimiz.   → FinOps, reserved capacity, right-sizing    |
|    Sustainability  → Graviton adoption, efficient architectures  |
|                                                                   |
|    When to Apply WAF:                                             |
|    +----------------------------------------------------------+   |
|    |  - Before launching new services (design review)         |   |
|    |  - After incidents (postmortem analysis)                 |   |
|    |  - Quarterly architecture reviews                        |   |
|    |  - During major migrations or refactoring                |   |
|    |  - When optimizing recurring costs                       |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

5.11 Linux Systems Perspective

Architecture Review Automation from Arch Linux

# Install AWS Well-Architected Tool CLI on Arch Linux
sudo pacman -S aws-cli-v2 jq

# List existing workload reviews
aws wellarchitected list-workloads \
    --query 'WorkloadSummaries[*].[WorkloadName,RiskCounts]' \
    --output table

# Create automated architecture checklist script
#!/bin/bash
# /usr/local/bin/wa-quick-check.sh
# Quick Well-Architected check for a deployment
set -euo pipefail

REGION=${1:-us-east-1}
echo "=== Quick Well-Architected Review — Region: $REGION ==="
echo ""

# RELIABILITY: Check Multi-AZ deployments
echo "--- Reliability: Multi-AZ Check ---"
SINGLE_AZ_INSTANCES=$(aws ec2 describe-instances \
    --region "$REGION" \
    --filters Name=instance-state-name,Values=running \
    --query 'Reservations[*].Instances[*].Placement.AvailabilityZone' \
    --output text | sort | uniq -c | sort -rn)
echo "Instance distribution by AZ:"
echo "$SINGLE_AZ_INSTANCES"

# RELIABILITY: Check for single-AZ RDS
echo ""
echo "--- Reliability: RDS Multi-AZ Check ---"
aws rds describe-db-instances \
    --region "$REGION" \
    --query 'DBInstances[?MultiAZ==`false`].[DBInstanceIdentifier,Engine,DBInstanceClass]' \
    --output table 2>/dev/null || echo "No RDS instances found"

# SECURITY: Check for public S3 buckets
echo ""
echo "--- Security: Public S3 Buckets Check ---"
for bucket in $(aws s3api list-buckets --query 'Buckets[*].Name' --output text); do
    public=$(aws s3api get-public-access-block --bucket "$bucket" 2>/dev/null | \
        jq '.PublicAccessBlockConfiguration | all' 2>/dev/null || echo "UNKNOWN")
    if [ "$public" != "true" ]; then
        echo "⚠️  Bucket $bucket may be public"
    fi
done

# COST: Check for unattached EBS volumes
echo ""
echo "--- Cost: Unattached EBS Volumes ---"
aws ec2 describe-volumes \
    --region "$REGION" \
    --filters Name=status,Values=available \
    --query 'Volumes[*].[VolumeId,Size,VolumeType]' \
    --output table

# PERFORMANCE: Check instance utilization
echo ""
echo "--- Performance: Low CPU Instances (< 10% avg) ---"
for instance in $(aws ec2 describe-instances \
    --region "$REGION" \
    --filters Name=instance-state-name,Values=running \
    --query 'Reservations[*].Instances[*].InstanceId' \
    --output text); do
    cpu=$(aws cloudwatch get-metric-statistics \
        --region "$REGION" \
        --namespace AWS/EC2 \
        --metric-name CPUUtilization \
        --dimensions Name=InstanceId,Value="$instance" \
        --start-time $(date -u -d '7 days ago' '+%Y-%m-%dT%H:%M:%S') \
        --end-time $(date -u '+%Y-%m-%dT%H:%M:%S') \
        --period 604800 \
        --statistics Average \
        --query 'Datapoints[0].Average' \
        --output text 2>/dev/null)
    if [ "$cpu" != "None" ] && (( $(echo "$cpu < 10" | bc -l 2>/dev/null || echo 0) )); then
        echo "⚠️  Instance $instance — Avg CPU: ${cpu}%"
    fi
done

5.12 Real-World Production Scenarios

Scenario: Post-Incident Architecture Review

                    WAF-Based Incident Postmortem
+------------------------------------------------------------------+
|                                                                   |
|    Incident: Database failover took 15 minutes instead of <1 min |
|                                                                   |
|    WAF Pillar Analysis:                                           |
|                                                                   |
|    Reliability:                                                   |
|    +----------------------------------------------------------+   |
|    |  ❌ RDS was NOT Multi-AZ (single AZ deployment)           |   |
|    |  ❌ No automated failover target                          |   |
|    |  ❌ Manual intervention required for recovery              |   |
|    |  ✅ Fix: Enable Multi-AZ, test failover quarterly         |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Performance:                                                   |
|    +----------------------------------------------------------+   |
|    |  ❌ No read replicas for read-heavy queries                |   |
|    |  ❌ Application reconnection logic had 60s timeout        |   |
|    |  ✅ Fix: Add read replica, reduce connection timeout      |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Security:                                                      |
|    +----------------------------------------------------------+   |
|    |  ✅ Encryption at rest was enabled                         |   |
|    |  ✅ No public accessibility                                |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    Cost:                                                          |
|    +----------------------------------------------------------+   |
|    |  Multi-AZ adds ~$150/month                               |   |
|    |  Read replica adds ~$100/month                            |   |
|    |  vs. 15 min outage cost: ~$50,000 in lost revenue        |   |
|    |  ROI: Investment pays off in <1 incident                 |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

5.13 Troubleshooting Guide

Issue	Cause	Solution
WAF Tool shows high-risk items	Missing best practices	Create improvement plan, prioritize by impact
Cannot create workload in WAF Tool	Missing IAM permissions	Need `wellarchitected:*` permissions
Pillar scores not improving	Fixes not verified	Run WAF review after implementing changes
Trade-off between pillars	Optimization in one affects another	Document trade-offs in ADRs (Architecture Decision Records)
Team not following best practices	No enforcement mechanism	Automate checks in CI/CD pipeline

5.14 Common Mistakes & Anti-Patterns

                    Architecture Anti-Patterns
+------------------------------------------------------------------+
|                                                                   |
|    ❌ Mistake 1: Reviewing Architecture Only Once                  |
|    +----------------------------------------------------------+   |
|    |  Problem: WAF review done at launch, never revisited     |   |
|    |  Impact: Architecture drift, accumulating tech debt      |   |
|    |  Fix: Quarterly reviews, after major incidents           |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    ❌ Mistake 2: Optimizing for One Pillar Only                    |
|    +----------------------------------------------------------+   |
|    |  Problem: Over-optimizing cost at expense of reliability |   |
|    |  Impact: Single-AZ deployments, no backups               |   |
|    |  Fix: Balance all five pillars, document trade-offs      |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    ❌ Mistake 3: No Disaster Recovery Testing                      |
|    +----------------------------------------------------------+   |
|    |  Problem: DR plan exists but never tested                |   |
|    |  Impact: Failover fails when actually needed             |   |
|    |  Fix: Quarterly Game Days, automated DR testing          |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    ❌ Mistake 4: Skipping Architecture Decision Records            |
|    +----------------------------------------------------------+   |
|    |  Problem: No documentation of why decisions were made    |   |
|    |  Impact: Repeated debates, knowledge loss when people leave|  |
|    |  Fix: ADR for every significant architecture decision    |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

5.15 Interview Questions

Conceptual Questions

Q: Name the five pillars of the Well-Architected Framework.
- A: Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability. Each pillar provides best practices and design principles for building well-architected systems.
Q: What’s the difference between RPO and RTO?
- A: RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time. RTO (Recovery Time Objective) is the maximum acceptable downtime. Example: RPO of 1 hour means you can lose at most 1 hour of data; RTO of 30 minutes means you must be back online within 30 minutes.
Q: Explain the four DR strategies in order of cost and recovery time.
- A: (1) Backup & Restore: cheapest, hours to recover. (2) Pilot Light: core infrastructure running, minutes to scale up. (3) Warm Standby: scaled-down copy running, minutes to scale to full. (4) Active-Active: full copies in multiple regions, near-zero downtime; most expensive.

Scenario-Based Questions

Q: Your application needs 99.99% availability. How do you architect it?
- A: Multi-AZ deployment across at least 3 AZs. ALB for traffic distribution. Auto Scaling for self-healing. Multi-AZ RDS with automated failover. Route 53 health checks. Consider multi-region warm standby for DR. 99.99% allows only 52.6 minutes downtime per year.
Q: How do you choose between serverless and containers for a new service?
- A: Consider: (1) Request pattern — sporadic/bursty favors Lambda, sustained favors containers, (2) Execution time — Lambda has 15-min limit, (3) Cold start tolerance, (4) Cost at scale — Lambda is cheaper at low volumes, containers at high volumes, (5) Team expertise, (6) Vendor lock-in tolerance.

5.16 Exam Tips

Five Pillars: Security, Reliability, Performance, Cost, Sustainability
Trade-offs: Understand how decisions affect multiple pillars
Design Principles: Know the principles for each pillar
Well-Architected Tool: Use for architecture reviews
RPO/RTO: Know the difference and how they affect DR strategy
Right-Sizing: Key for both cost and performance optimization
Defense in Depth: Security approach with multiple layers
Serverless: Often the best choice for performance and cost

Next Chapter

Chapter 6: Amazon EC2 - Deep Dive

Last Updated: March 2026

Last Updated: February 2026