AWS Global Infrastructure
Chapter 1: AWS Global Infrastructure
Section titled “Chapter 1: AWS Global Infrastructure”Understanding the Foundation of AWS
Section titled “Understanding the Foundation of AWS”1.1 Overview
Section titled “1.1 Overview”AWS operates the most comprehensive global cloud infrastructure, enabling organizations to deploy applications closer to end users while maintaining high availability and fault tolerance.
AWS Global Infrastructure Map ================================================================================
NORTH AMERICA | +--------------------+--------------------+ | | | US-East-1 US-West-1 US-West-2 (N. Virginia) (N. California) (Oregon) | | | v v v +---------+ +---------+ +---------+ | 6 AZs | | 3 AZs | | 4 AZs | +---------+ +---------+ +---------+
EUROPE | +------------------------+------------------------+ | | | EU-West-1 EU-Central-1 EU-West-2 (Ireland) (Frankfurt) (London) | | | v v v +---------+ +---------+ +---------+ | 3 AZs | | 3 AZs | | 3 AZs | +---------+ +---------+ +---------+
ASIA PACIFIC | +------------------------+---+------------------------+ | | | AP-Southeast-1 AP-Northeast-1 AP-South-1 (Singapore) (Tokyo) (Mumbai) | | | v v v +---------+ +---------+ +---------+ | 3 AZs | | 4 AZs | | 3 AZs | +---------+ +---------+ +---------+
================================================================================1.2 Key Components
Section titled “1.2 Key Components”Regions
Section titled “Regions”A Region is a physical geographic location where AWS clusters data centers.
Region Architecture:+------------------------------------------------------------------+| AWS Region || || +----------------+ +----------------+ +----------------+ || | Availability | | Availability | | Availability | || | Zone A (AZ-a) | | Zone B (AZ-b) | | Zone C (AZ-c) | || | | | | | | || | +----------+ | | +----------+ | | +----------+ | || | |Datacenter| | | |Datacenter| | | |Datacenter| | || | | DC-1 | | | | DC-3 | | | | DC-5 | | || | +----------+ | | +----------+ | | +----------+ | || | +----------+ | | +----------+ | | +----------+ | || | |Datacenter| | | |Datacenter| | | |Datacenter| | || | | DC-2 | | | | DC-4 | | | | DC-6 | | || | +----------+ | | +----------+ | | +----------+ | || +----------------+ +----------------+ +----------------+ || || AZs are: || - Physically separated (km apart) || - Connected via low-latency links || - Isolated from failures in other AZs |+------------------------------------------------------------------+Region Selection Criteria
Section titled “Region Selection Criteria”| Factor | Description | Example |
|---|---|---|
| Latency | Choose region closest to users | Asia users -> AP-Southeast-1 |
| Cost | Prices vary by region | US-East-1 often cheapest |
| Compliance | Data residency requirements | EU data -> EU-West-1 |
| Service Availability | Not all services in all regions | New services often US first |
| SLA Requirements | Some regions have better SLAs | GovCloud for government |
Availability Zones (AZs)
Section titled “Availability Zones (AZs)”An Availability Zone is one or more discrete data centers with redundant power, networking, and connectivity.
Availability Zone Deep Dive:+------------------------------------------------------------------+| Availability Zone Architecture || || +------------------------------------------------------------+ || | Physical Data Center | || | | || | +-------------+ +-------------+ +-------------+ | || | | Power | | Cooling | | Network | | || | | Grid A | | System A | | Provider A| | || | +-------------+ +-------------+ +-------------+ | || | | | | | || | v v v | || | +----------------------------------------------------+ | || | | Redundant Infrastructure | | || | +----------------------------------------------------+ | || | | | | | || | v v v | || | +-------------+ +-------------+ +-------------+ | || | | Power | | Cooling | | Network | | || | | Grid B | | System B | | Provider B| | || | +-------------+ +-------------+ +-------------+ | || | | || | +----------------------------------------------------+ | || | | Server Racks (Thousands) | | || | | +--------+ +--------+ +--------+ +--------+ | | || | | | Rack 1 | | Rack 2 | | Rack 3 | | Rack N | | | || | | +--------+ +--------+ +--------+ +--------+ | | || | +----------------------------------------------------+ | || +------------------------------------------------------------+ |+------------------------------------------------------------------+AZ Best Practices
Section titled “AZ Best Practices” Multi-AZ Deployment Pattern+------------------------------------------------------------------+| || Internet || | || v || +----------+ || |Route 53/ | || |CloudFront| || +----------+ || | || v || +----------------------------------------------------------------+| | Application Load Balancer || +----------------------------------------------------------------+| | | | || v v v || +----------+ +----------+ +----------+ || | AZ-A | | AZ-B | | AZ-C | || | | | | | | || | +------+ | | +------+ | | +------+ | || | | EC2 | | | | EC2 | | | | EC2 | | || | | App | | | | App | | | | App | | || | +------+ | | +------+ | | +------+ | || | | | | | | || | +------+ | | +------+ | | +------+ | || | | RDS |<-------->| | RDS |<-------->| | RDS | || | |Primary| | | |Replica| | | |Replica| | || | +------+ | | +------+ | | +------+ | || +----------+ +----------+ +----------+ || || Benefits: || - Fault tolerance (survive AZ failure) || - High availability (99.99% uptime) || - Disaster recovery built-in |+------------------------------------------------------------------+Edge Locations
Section titled “Edge Locations”Edge Locations are endpoints for AWS content delivery network (CloudFront) and DNS (Route 53).
Edge Location Network+------------------------------------------------------------------+| || AWS Global Network Backbone || ============================================================ || || +-------------+ +-------------+ +-------------+ || | Edge Loc 1 | | Edge Loc 2 | | Edge Loc N | || | (New York) | | (London) | | (Tokyo) | || +------+------+ +------+------+ +------+------+ || | | | || +--------+----------+--------+----------+ || | | || v v || +-------------+ +-------------+ || | Region | | Region | || | (us-east-1) | | (eu-west-1) | || +-------------+ +-------------+ || || Edge Locations: || - 400+ locations globally || - Lower latency for end users || - Cache content closer to users || - DNS resolution endpoints |+------------------------------------------------------------------+1.3 AWS Global Network
Section titled “1.3 AWS Global Network” AWS Global Network Architecture+------------------------------------------------------------------+| || AWS Global Network || ============================================================ || || +----------------------------------------------------------+ || | Network Backbone | || | | || | Region A <=======> Region B <=======> Region C | || | | | | | || | v v v | || | +--+--+ +--+--+ +--+--+ | || | | VPC | | VPC | | VPC | | || | +--+--+ +--+--+ +--+--+ | || | | | | | || | +--------+----------+--------+----------+ | || | | | | || | v v | || | +-------+ +-------+ | || | | Edge | | Edge | | || | | Loc 1 | | Loc 2 | | || | +-------+ +-------+ | || +----------------------------------------------------------+ || || Features: || - Private fiber network || - Redundant paths || - Low-latency inter-region connectivity || - Automatic failover |+------------------------------------------------------------------+1.4 Regional Services vs Global Services
Section titled “1.4 Regional Services vs Global Services”Global Services (No Region Selection Required)
Section titled “Global Services (No Region Selection Required)”| Service | Purpose |
|---|---|
| IAM | Identity and Access Management |
| Route 53 | DNS Service |
| CloudFront | Content Delivery Network |
| WAF | Web Application Firewall |
| AWS Organizations | Multi-account management |
| AWS Shield | DDoS protection |
Regional Services (Region Selection Required)
Section titled “Regional Services (Region Selection Required)”| Service | Purpose |
|---|---|
| EC2 | Virtual Machines |
| S3 | Object Storage (with regional buckets) |
| RDS | Relational Databases |
| Lambda | Serverless Computing |
| VPC | Virtual Private Cloud |
Service Scope Diagram+------------------------------------------------------------------+| || Global Services Regional Services || +----------------+ +----------------+ || | | | | || | +----------+ | | Region A | || | | IAM | | | +----------+ | || | +----------+ | | | EC2 | | || | +----------+ | | +----------+ | || | | Route 53 | | | +----------+ | || | +----------+ | | | RDS | | || | +----------+ | | +----------+ | || | |CloudFront| | | | || | +----------+ | | Region B | || | | | +----------+ | || | Replicated | | | EC2 | | || | Globally | | +----------+ | || | | | +----------+ | || +----------------+ | | RDS | | || | +----------+ | || | | || +----------------+ || |+------------------------------------------------------------------+1.5 Choosing the Right Region
Section titled “1.5 Choosing the Right Region”Decision Flowchart
Section titled “Decision Flowchart” Region Selection Decision Tree+------------------------------------------------------------------+| || Start: Choose Region || | || v || +---------------------+ || | Compliance Required?| || +----------+----------+ || | || +------------+------------+ || | | || v v || (Yes) (No) || | | || v v || +------------------+ +---------------------+ || | Select compliant | | Latency Critical? | || | region (e.g., | +----------+----------+ || | EU for GDPR) | | || +------------------+ +---------+---------+ || | | || v v || (Yes) (No) || | | || v v || +------------------+ +------------------+ || | Select closest | | Cost Primary | || | region to users | | Factor? | || +------------------+ +--------+---------+ || | || +---------+---------+ || | | || v v || (Yes) (No) || | | || v v || +---------------+ +-------------+ || | US-East-1 | | Service | || | (often lowest)| | Available? | || +---------------+ +------+------+ || | || +------+------+ || | | || v v || (Yes) (No)|| | | || v v || +----------+ +----------+| | Any | | Check || | Region | | Service || +----------+ | Page || +----------++------------------------------------------------------------------+1.6 Infrastructure Security
Section titled “1.6 Infrastructure Security”Physical Security Layers
Section titled “Physical Security Layers” Data Center Physical Security+------------------------------------------------------------------+| || Layer 1: Perimeter Security || +----------------------------------------------------------+ || | - Fencing and barriers | || | - Security patrols | || | - Video surveillance | || +----------------------------------------------------------+ || | || v || Layer 2: Building Access || +----------------------------------------------------------+ || | - Badge readers | || | - Biometric scanners | || | - Security personnel | || +----------------------------------------------------------+ || | || v || Layer 3: Data Center Floor || +----------------------------------------------------------+ || | - Mantraps (one person at a time) | || | - Additional authentication | || | - Motion sensors | || +----------------------------------------------------------+ || | || v || Layer 4: Equipment Access || +----------------------------------------------------------+ || | - Locked cabinets | || | - Cage enclosures | || | - Audit logging | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+1.7 High Availability Architecture Patterns
Section titled “1.7 High Availability Architecture Patterns”Pattern 1: Multi-AZ Deployment
Section titled “Pattern 1: Multi-AZ Deployment” Multi-AZ Architecture+------------------------------------------------------------------+| || Internet || | || v || +---------------+ || | Route 53 | || +---------------+ || | || v || +---------------+ || | CloudFront | || +---------------+ || | || v || +-----------------------------------+ || | Application Load Balancer | || +-----------------------------------+ || | | | || v v v || +----------+ +----------+ +----------+ || | AZ-A | | AZ-B | | AZ-C | || | | | | | | || | +------+ | | +------+ | | +------+ | || | | EC2 | | | | EC2 | | | | EC2 | | || | +------+ | | +------+ | | +------+ | || | | | | | | || | +------+ | | +------+ | | +------+ | || | | RDS | | | | RDS | | | | RDS | | || | |(Main)| | | |(Stand| | | |(Stand| | || | +------+ | | | by) | | | | by) | | || | | | +------+ | | +------+ | || +----------+ +----------+ +----------+ || || SLA: 99.99% availability |+------------------------------------------------------------------+Pattern 2: Multi-Region Deployment
Section titled “Pattern 2: Multi-Region Deployment” Multi-Region Architecture+------------------------------------------------------------------+| || Internet || | || v || +---------------+ || | Route 53 | || | (Latency-based| || | Routing) | || +---------------+ || / \ || / \ || v v || +---------------+ +---------------+ || | US-EAST-1 | | EU-WEST-1 | || | (Primary) | | (Secondary) | || +---------------+ +---------------+ || | | || v v || +---------------+ +---------------+ || | ALB | | ALB | || +---------------+ +---------------+ || | | || v v || +---------------+ +---------------+ || | EC2 Fleet | | EC2 Fleet | || +---------------+ +---------------+ || | | || v v || +---------------+ +---------------+ || | RDS Primary | | RDS Read | || | | | Replica | || +---------------+ +---------------+ || | | || +--------+-----------+ || | || v || +---------------+ || | S3 Cross- | || | Region Repl. | || +---------------+ || || SLA: 99.999% availability |+------------------------------------------------------------------+1.8 Key Metrics & SLAs
Section titled “1.8 Key Metrics & SLAs”Service Level Agreements by Service
Section titled “Service Level Agreements by Service”| Service | Monthly Uptime SLA | Annual Downtime Allowed |
|---|---|---|
| EC2 | 99.99% | ~52 minutes |
| S3 | 99.9% | ~8.7 hours |
| RDS Multi-AZ | 99.95% | ~4.4 hours |
| Lambda | 99.95% | ~4.4 hours |
| CloudFront | 99.9% | ~8.7 hours |
Calculating Availability
Section titled “Calculating Availability”Availability Calculation:+------------------------------------------------------------------+| || Availability = (Total Time - Downtime) / Total Time || || Example: 99.99% availability || || Monthly: 30 days × 24 hours × 60 minutes = 43,200 minutes || Allowed Downtime: 43,200 × (1 - 0.9999) = 4.32 minutes || || Availability Tiers: || +--------+----------+------------------+ || | Nines | Uptime | Annual Downtime | || +--------+----------+------------------+ || | 2 | 99% | 3.65 days | || | 3 | 99.9% | 8.77 hours | || | 4 | 99.99% | 52.60 minutes | || | 5 | 99.999% | 5.26 minutes | || +--------+----------+------------------+ || |+------------------------------------------------------------------+1.9 Practical Commands
Section titled “1.9 Practical Commands”AWS CLI - Region Operations
Section titled “AWS CLI - Region Operations”# List all available regionsaws ec2 describe-regions --query 'Regions[*].RegionName' --output table
# List Availability Zones in a regionaws ec2 describe-availability-zones \ --region us-east-1 \ --query 'AvailabilityZones[*].ZoneName' \ --output table
# Get current regionaws configure get region
# Set default regionaws configure set region us-west-2
# List edge locations (via CloudFront)aws cloudfront list-distributions --query 'DistributionList.Items[*].Origins.Items[*].DomainName'SDK Example (Python/boto3)
Section titled “SDK Example (Python/boto3)”import boto3
# List all regionsec2 = boto3.client('ec2', region_name='us-east-1')regions = ec2.describe_regions()for region in regions['Regions']: print(f"Region: {region['RegionName']}, Endpoint: {region['Endpoint']}")
# List AZs in a specific regionec2_us_east_1 = boto3.client('ec2', region_name='us-east-1')azs = ec2_us_east_1.describe_availability_zones()for az in azs['AvailabilityZones']: print(f"AZ: {az['ZoneName']}, State: {az['State']}")1.10 Best Practices Summary
Section titled “1.10 Best Practices Summary” AWS Infrastructure Best Practices+------------------------------------------------------------------+| || 1. Always deploy across multiple Availability Zones || +----------------------------------------------+ || | Region | || | +--------+ +--------+ +--------+ | || | | AZ-A | | AZ-B | | AZ-C | | || | | EC2 | | EC2 | | EC2 | | || | +--------+ +--------+ +--------+ | || +----------------------------------------------+ || || 2. Choose regions based on: || - Latency to end users || - Compliance requirements || - Cost optimization || - Service availability || || 3. Use CloudFront for global content delivery || +----------------------------------------------+ || | Users -> Edge Location -> CloudFront -> Origin| || +----------------------------------------------+ || || 4. Implement disaster recovery across regions || +----------------------------------------------+ || | Primary Region -> Backup Region | || | (Active) (Active/Passive) | || +----------------------------------------------+ || || 5. Monitor infrastructure health || - Use AWS Health Dashboard || - Set up CloudWatch alarms || - Subscribe to AWS service alerts || |+------------------------------------------------------------------+1.11 Why This Matters in DevOps/SRE
Section titled “1.11 Why This Matters in DevOps/SRE”Understanding AWS global infrastructure is not just theoretical knowledge — it directly impacts every decision you make as a DevOps engineer or SRE. Here’s why:
Impact on DevOps/SRE Roles+------------------------------------------------------------------+| || 1. Deployment Strategy || +----------------------------------------------------------+ || | Your CI/CD pipeline must know which region to deploy to | || | Multi-region deploys require region-aware automation | || | AZ-aware deployments are critical for HA | || +----------------------------------------------------------+ || || 2. Incident Response || +----------------------------------------------------------+ || | When an AZ goes down, you need to understand failover | || | Region outages require DR activation procedures | || | Edge location issues affect CDN and DNS resolution | || +----------------------------------------------------------+ || || 3. Cost Management || +----------------------------------------------------------+ || | Data transfer between AZs costs money | || | Cross-region replication has bandwidth costs | || | Region pricing varies — us-east-1 is often cheapest | || +----------------------------------------------------------+ || || 4. Compliance & Data Residency || +----------------------------------------------------------+ || | GDPR requires EU data stays in EU regions | || | Healthcare (HIPAA) needs specific region configurations | || | Government workloads may need GovCloud | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+1.12 Linux Systems Perspective
Section titled “1.12 Linux Systems Perspective”As a DevOps engineer working from an Arch Linux workstation, here’s how you interact with AWS infrastructure from your terminal:
Setting Up AWS CLI on Arch Linux
Section titled “Setting Up AWS CLI on Arch Linux”# Install AWS CLI v2 on Arch Linuxsudo pacman -S aws-cli-v2
# Verify installationaws --version
# Install additional useful toolssudo pacman -S jq # JSON processor for AWS CLI outputsudo pacman -S python-boto3 # Python AWS SDKsudo pacman -S curl # HTTP client for API testing
# Install aws-vault for secure credential management (from AUR)yay -S aws-vault
# Configure AWS credentialsaws configure# AWS Access Key ID: AKIAIOSFODNN7EXAMPLE# AWS Secret Access Key: wJalrXUtnFEMI/...# Default region name: us-east-1# Default output format: jsonManaging Multiple AWS Profiles
Section titled “Managing Multiple AWS Profiles”# ~/.aws/config - Managing multiple environmentscat ~/.aws/config[default]region = us-east-1output = json
[profile staging]region = us-west-2output = json
[profile production]region = us-east-1output = jsonrole_arn = arn:aws:iam::PROD_ACCOUNT:role/DevOpsRolesource_profile = defaultmfa_serial = arn:aws:iam::DEV_ACCOUNT:mfa/your-username
# Use aws-vault for secure credential managementaws-vault exec production -- aws ec2 describe-instances
# Quick region check script (save as ~/bin/aws-region-check.sh)#!/bin/bashecho "=== AWS Region Latency Check ==="for region in us-east-1 us-west-2 eu-west-1 ap-south-1; do latency=$(curl -s -o /dev/null -w "%{time_total}" \ https://ec2.${region}.amazonaws.com/ping 2>/dev/null) echo "Region: ${region} - Latency: ${latency}s"doneMonitoring AWS Infrastructure from Linux
Section titled “Monitoring AWS Infrastructure from Linux”# Use systemd timer for periodic AWS health checks[Unit]Description=AWS Infrastructure Health Check
[Service]Type=oneshotExecStart=/usr/local/bin/aws-health-check.shUser=devops
# /etc/systemd/system/aws-health-check.timer[Unit]Description=Run AWS health check every 5 minutes
[Timer]OnCalendar=*:0/5Persistent=true
[Install]WantedBy=timers.target
# Enable the timersudo systemctl enable --now aws-health-check.timer
# Check timer statussystemctl list-timers --all | grep aws-healthUseful Linux CLI Patterns for AWS Region Operations
Section titled “Useful Linux CLI Patterns for AWS Region Operations”# List all regions with their AZ count using jqaws ec2 describe-regions --query 'Regions[*].RegionName' --output text | \ tr '\t' '\n' | while read region; do az_count=$(aws ec2 describe-availability-zones \ --region "$region" \ --query 'length(AvailabilityZones)' --output text 2>/dev/null) echo "$region: $az_count AZs"done
# Check which services are available in a regionaws ssm get-parameters-by-path \ --path /aws/service/global-infrastructure/regions/us-east-1/services \ --query 'Parameters[*].Value' --output text | tr '\t' '\n' | sort
# Monitor data transfer costs between regions (using CloudWatch)aws cloudwatch get-metric-statistics \ --namespace AWS/EC2 \ --metric-name NetworkOut \ --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%S) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \ --period 3600 \ --statistics Sum \ --region us-east-11.13 Real-World Production Scenarios
Section titled “1.13 Real-World Production Scenarios”Scenario 1: E-Commerce Platform — Region Selection
Section titled “Scenario 1: E-Commerce Platform — Region Selection” E-Commerce Multi-Region Setup+------------------------------------------------------------------+| || Requirements: || - Primary customers in India and Southeast Asia || - Compliance: Indian data must stay in India (RBI guidelines) || - RPO: 15 minutes, RTO: 30 minutes || || Solution: || +----------------------------------------------------------+ || | Primary Region: ap-south-1 (Mumbai) | || | - All customer data, order processing | || | - 3 AZs for high availability | || | | || | DR Region: ap-southeast-1 (Singapore) | || | - Read replicas for databases | || | - S3 cross-region replication | || | - Warm standby for critical services | || | | || | Edge: CloudFront with 20+ edge locations in Asia | || | - Static assets cached at edge | || | - API acceleration via Global Accelerator | || +----------------------------------------------------------+ || || Cost Impact: || - Cross-region data transfer: ~$0.09/GB || - Multi-AZ RDS: ~30% more than single-AZ || - CloudFront: ~$0.085/GB for first 10TB in India || |+------------------------------------------------------------------+Scenario 2: AZ Failure — Incident Response
Section titled “Scenario 2: AZ Failure — Incident Response” AZ Failure Response Playbook+------------------------------------------------------------------+| || Detection: || +----------------------------------------------------------+ || | 1. CloudWatch alarm fires: EC2 health checks failing | || | 2. ALB reports unhealthy targets in AZ-a | || | 3. AWS Health Dashboard shows AZ degradation | || +----------------------------------------------------------+ || | || v || Immediate Response (Automated): || +----------------------------------------------------------+ || | 1. ALB automatically routes traffic to healthy AZs | || | 2. Auto Scaling launches replacements in other AZs | || | 3. RDS fails over to standby (if Multi-AZ) | || +----------------------------------------------------------+ || | || v || Manual Verification: || +----------------------------------------------------------+ || | 1. Verify all services are running in remaining AZs | || | 2. Check database failover completed successfully | || | 3. Monitor error rates and latency | || | 4. Communicate status to stakeholders | || +----------------------------------------------------------+ || | || v || Post-Incident: || +----------------------------------------------------------+ || | 1. Wait for AWS to resolve AZ issue | || | 2. Verify instances in affected AZ recover | || | 3. Rebalance capacity across all AZs | || | 4. Write incident postmortem | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+Scenario 3: Global SaaS — Multi-Region Active-Active
Section titled “Scenario 3: Global SaaS — Multi-Region Active-Active” Active-Active Multi-Region Architecture+------------------------------------------------------------------+| || Users Worldwide || | || v || +----------------------------+ || | Route 53 (Latency-based) | || +----------------------------+ || | | | || v v v || US-EAST-1 EU-WEST-1 AP-NORTHEAST-1 || (Virginia) (Ireland) (Tokyo) || | | | || +---------+ +---------+ +---------+ || | DynamoDB| | DynamoDB| | DynamoDB| || | Global |<-->| Global |<-->| Global | || | Table | | Table | | Table | || +---------+ +---------+ +---------+ || || Key Design Decisions: || - DynamoDB Global Tables for multi-master writes || - Each region handles its local traffic independently || - Conflict resolution via last-writer-wins || - Regional S3 buckets with cross-region replication || |+------------------------------------------------------------------+1.14 Operational Considerations
Section titled “1.14 Operational Considerations”Capacity Planning
Section titled “Capacity Planning” Regional Capacity Planning+------------------------------------------------------------------+| || Questions to Answer Before Deployment: || || 1. How many AZs do we need? || - Minimum 2 for HA, recommended 3 for production || - Cost: ~10-15% more per additional AZ || || 2. Do we need multi-region? || - Latency requirements (>200ms = consider multi-region) || - Compliance requirements || - DR requirements (RPO/RTO) || || 3. What's our data transfer budget? || - Same AZ: Free || - Cross-AZ: $0.01/GB (each direction) || - Cross-region: $0.02-0.09/GB || - Internet egress: $0.09/GB (first 10TB) || || 4. Service limits per region: || - EC2: 5 Elastic IPs, 20 instances (default) || - VPC: 5 VPCs per region (default) || - Request limit increases BEFORE you need them || |+------------------------------------------------------------------+On-Call Runbook: Infrastructure Issues
Section titled “On-Call Runbook: Infrastructure Issues”# Quick diagnostic script for on-call engineers#!/bin/bashset -euo pipefail
REGION=${1:-us-east-1}echo "=== AWS Infrastructure Health Check — Region: $REGION ==="echo "Timestamp: $(date -u '+%Y-%m-%d %H:%M:%S UTC')"echo ""
# Check AWS service healthecho "--- Service Health ---"aws health describe-events \ --filter '{"regions":["'$REGION'"],"eventStatusCodes":["open","upcoming"]}' \ --query 'events[*].[service,eventTypeCode,statusCode]' \ --output table --region us-east-1 2>/dev/null || echo "No active events"
# Check AZ statusecho ""echo "--- Availability Zone Status ---"aws ec2 describe-availability-zones \ --region "$REGION" \ --query 'AvailabilityZones[*].[ZoneName,State,ZoneType]' \ --output table
# Check running instances by AZecho ""echo "--- Instance Distribution by AZ ---"aws ec2 describe-instances \ --region "$REGION" \ --filters "Name=instance-state-name,Values=running" \ --query 'Reservations[*].Instances[*].[Placement.AvailabilityZone]' \ --output text | sort | uniq -c | sort -rn
# Check recent CloudTrail events for infrastructure changesecho ""echo "--- Recent Infrastructure Changes (last 1 hour) ---"aws cloudtrail lookup-events \ --region "$REGION" \ --start-time $(date -u -d '1 hour ago' '+%Y-%m-%dT%H:%M:%SZ') \ --lookup-attributes AttributeKey=EventName,AttributeValue=RunInstances \ --query 'Events[*].[EventTime,Username,EventName]' \ --output table 2>/dev/null || echo "No recent events"1.15 Troubleshooting Guide
Section titled “1.15 Troubleshooting Guide”Common Issues and Solutions
Section titled “Common Issues and Solutions”| Issue | Cause | Solution |
|---|---|---|
| Cannot launch instance in AZ | AZ capacity constraints | Try another AZ in the same region |
| High latency to region | Geographic distance | Use CloudFront or choose closer region |
| Service not available in region | Not all services are global | Check service availability page, use a supported region |
| Cross-region replication lag | Network congestion | Monitor replication metrics, consider async patterns |
| Region failover not working | DNS TTL too high | Reduce Route 53 TTL before DR events |
| Hitting service limits | Default quotas reached | Request limit increase via Service Quotas |
Debugging Region/AZ Issues from Linux
Section titled “Debugging Region/AZ Issues from Linux”# Test connectivity to a specific region endpointcurl -s -o /dev/null -w "HTTP Status: %{http_code}\nTime: %{time_total}s\n" \ https://ec2.us-east-1.amazonaws.com
# Check DNS resolution for AWS endpointsdig +short ec2.us-east-1.amazonaws.com
# Test network path to AWS regiontraceroute ec2.us-east-1.amazonaws.com
# Check if you can reach specific AZ endpointsfor az in a b c d e f; do echo -n "us-east-1${az}: " aws ec2 describe-availability-zones \ --zone-names "us-east-1${az}" \ --query 'AvailabilityZones[0].State' \ --output text --region us-east-1 2>/dev/null || echo "N/A"done
# Check your current service quotas for a regionaws service-quotas list-service-quotas \ --service-code ec2 \ --region us-east-1 \ --query 'Quotas[?QuotaName==`Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances`].[QuotaName,Value]' \ --output table1.16 Common Mistakes & Anti-Patterns
Section titled “1.16 Common Mistakes & Anti-Patterns” Infrastructure Anti-Patterns+------------------------------------------------------------------+| || ❌ Mistake 1: Single-AZ Deployments in Production || +----------------------------------------------------------+ || | Problem: All resources in one AZ | || | Impact: Complete outage if AZ fails | || | Fix: Always deploy across ≥2 AZs for production | || +----------------------------------------------------------+ || || ❌ Mistake 2: Ignoring Data Transfer Costs || +----------------------------------------------------------+ || | Problem: Services chatting across AZs/regions | || | Impact: Unexpected bills — can be $1000s/month | || | Fix: Co-locate dependent services, use VPC endpoints | || +----------------------------------------------------------+ || || ❌ Mistake 3: Not Requesting Limit Increases Early || +----------------------------------------------------------+ || | Problem: Hit EC2 instance limits during traffic spike | || | Impact: Cannot scale when you need it most | || | Fix: Request increases 2-4 weeks before expected growth | || +----------------------------------------------------------+ || || ❌ Mistake 4: Hardcoding Region/AZ in Application Code || +----------------------------------------------------------+ || | Problem: Region references hardcoded in configs | || | Impact: Cannot fail over or migrate to another region | || | Fix: Use instance metadata service or SSM parameters | || +----------------------------------------------------------+ || || ❌ Mistake 5: Not Testing DR Failover || +----------------------------------------------------------+ || | Problem: DR plan exists on paper but never tested | || | Impact: When disaster strikes, failover doesn't work | || | Fix: Schedule quarterly DR drills (Game Days) | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+1.17 Interview Questions
Section titled “1.17 Interview Questions”Conceptual Questions
Section titled “Conceptual Questions”-
Q: What’s the difference between a Region, AZ, and Edge Location?
- A: A Region is a geographic area with multiple AZs. An AZ is one or more data centers with independent power/cooling/networking within a region. Edge Locations are CDN endpoints used by CloudFront and Route 53 for low-latency content delivery — there are 400+ of them vs ~30+ regions.
-
Q: How do you decide which AWS region to deploy in?
- A: Consider: (1) Latency to end users, (2) Compliance/data residency requirements, (3) Service availability in that region, (4) Cost — prices vary by region, (5) DR strategy — you may need a secondary region.
-
Q: What happens when an AZ goes down?
- A: If properly architected (Multi-AZ), the ALB stops routing to unhealthy targets, Auto Scaling launches instances in healthy AZs, and RDS fails over to standby. For poorly architected systems (single-AZ), it’s a full outage.
Scenario-Based Questions
Section titled “Scenario-Based Questions”-
Q: Your application has users in India and the US. How would you architect it?
- A: Use Route 53 latency-based routing to direct users to the nearest region (ap-south-1 for India, us-east-1 for US). Deploy identical application stacks in both regions. Use DynamoDB Global Tables or Aurora Global Database for data replication. Cache static content with CloudFront.
-
Q: You’re getting intermittent 503 errors during peak hours. Your instances are all in one AZ. What do you do?
- A: Immediate: Scale out within the current AZ. Short-term: Distribute instances across multiple AZs behind an ALB. Long-term: Implement Auto Scaling with multi-AZ deployment, set up CloudWatch alarms for early warning, and request service limit increases.
-
Q: How would you estimate data transfer costs for a multi-region deployment?
- A: Map all data flows: (1) Cross-AZ traffic within each region (
$0.01/GB), (2) Cross-region replication ($0.02-0.09/GB), (3) Internet egress (~$0.09/GB for first 10TB), (4) CloudFront egress (varies by edge location). Use AWS Cost Explorer or the Pricing Calculator.
- A: Map all data flows: (1) Cross-AZ traffic within each region (
1.18 Exam Tips
Section titled “1.18 Exam Tips”- Regions vs AZs: Regions are geographic areas; AZs are isolated locations within regions
- Global Services: IAM, Route 53, CloudFront, WAF are global - no region selection needed
- Multi-AZ: Always use multiple AZs for production workloads
- SLA Math: Know how to calculate allowed downtime from availability percentage
- Edge Locations: Used by CloudFront and Route 53, not for compute
Next Chapter
Section titled “Next Chapter”Chapter 2: AWS Account Management & Billing
Last Updated: March 2026