Ha_dr
Chapter 46: High Availability & Disaster Recovery Architecture
Section titled “Chapter 46: High Availability & Disaster Recovery Architecture”Building Resilient AWS Architectures
Section titled “Building Resilient AWS Architectures”46.1 Overview
Section titled “46.1 Overview”High Availability (HA) and Disaster Recovery (DR) are critical components of enterprise architecture, ensuring business continuity and minimal downtime during failures.
HA & DR Overview+------------------------------------------------------------------+| || +------------------------+ || | Resilient Design | || +------------------------+ || | || +---------------------+---------------------+ || | | | | || v v v v || +----------+ +----------+ +----------+ +----------+ || | High | | Disaster | | Fault | | Recovery | || | Availabi | | Recovery | | Tolerance| | Time | || | -lity | | | | | | Objectives| || | | | | | | | | || | - Uptime | | - DR Plan| | - Redund | | - RTO | || | - Redund | | - Backup | | - Failover| | - RPO | || | - Load | | - Restore| | - Self-Heal| | - SLA | || +----------+ +----------+ +----------+ +----------+ || |+------------------------------------------------------------------+Key Concepts
Section titled “Key Concepts”| Concept | Description |
|---|---|
| RTO | Recovery Time Objective - Maximum acceptable downtime |
| RPO | Recovery Point Objective - Maximum acceptable data loss |
| SLA | Service Level Agreement - Contractual uptime guarantee |
| MTTR | Mean Time To Recovery - Average recovery time |
46.2 Recovery Objectives
Section titled “46.2 Recovery Objectives”RTO and RPO
Section titled “RTO and RPO” Recovery Objectives+------------------------------------------------------------------+| || RTO (Recovery Time Objective) || +----------------------------------------------------------+ || | | || | Disaster Recovery Time Service Restored | || | +--------+ +----------------+ +----------+ | || | | Event | ----> | | --> | Restored | | || | +--------+ +----------------+ +----------+ | || | |<----- RTO ---->| | || | | || | Examples: | || | - Mission Critical: < 15 minutes | || | - Business Critical: < 1 hour | || | - Business Operational: < 4 hours | || | - Non-Critical: < 24 hours | || | | || +----------------------------------------------------------+ || || RPO (Recovery Point Objective) || +----------------------------------------------------------+ || | | || | Last Backup Data Loss Recovery Point | || | +--------+ +--------+ +----------+ | || | | Backup | ------->| Lost | --------> | Recover | | || | +--------+ +--------+ +----------+ | || | |<--------------- RPO --------------->| | || | | || | Examples: | || | - Zero Data Loss: RPO = 0 (synchronous replication) | || | - Near-Zero: RPO < 1 minute (async replication) | || | - Low: RPO < 1 hour (frequent backups) | || | - Standard: RPO < 24 hours (daily backups) | || | | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+SLA Calculations
Section titled “SLA Calculations” Availability SLA+------------------------------------------------------------------+| || Availability Percentage | Max Downtime per Year || ---------------------------+------------------------------------ || 99% (Two 9s) | 3.65 days (87.6 hours) || 99.9% (Three 9s) | 8.76 hours || 99.95% | 4.38 hours || 99.99% (Four 9s) | 52.6 minutes || 99.999% (Five 9s) | 5.26 minutes || || Calculation: || Availability = (Total Time - Downtime) / Total Time × 100 || || Example for 99.99% SLA: || - Per Year: 365 × 24 × 60 × (1 - 0.9999) = 52.6 minutes || - Per Month: 30 × 24 × 60 × (1 - 0.9999) = 4.3 minutes || - Per Week: 7 × 24 × 60 × (1 - 0.9999) = 1 minute || |+------------------------------------------------------------------+46.3 High Availability Patterns
Section titled “46.3 High Availability Patterns”Multi-AZ Architecture
Section titled “Multi-AZ Architecture” Multi-AZ Architecture+------------------------------------------------------------------+| || Region (us-east-1) || +----------------------------------------------------------+ || | | || | +------------------+ +------------------+ | || | | Availability | | Availability | | || | | Zone A | | Zone B | | || | | +--------------+ | | +--------------+ | | || | | | | | | | | | | || | | | +--------+ | | | | +--------+ | | | || | | | | EC2 | | | | | | EC2 | | | | || | | | | Active | | | | | | Standby| | | | || | | | +--------+ | | | | +--------+ | | | || | | | | | | | | | | || | | +------+-------+ | | +------+-------+ | | || | | | | | | | | || | +--------+---------+ +--------+---------+ | || | | | | || | v v | || | +----------------------------------------------------------+ || | | Application Load Balancer | || | +----------------------------------------------------------+ || | | | | || | v v | || | +------------------+ +------------------+ | || | | RDS Primary | | RDS Standby | | || | | (Sync Replication)|<-| (Multi-AZ) | | || | +------------------+ +------------------+ | || | | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+Active-Active Architecture
Section titled “Active-Active Architecture” Active-Active Architecture+------------------------------------------------------------------+| || Region (us-east-1) || +----------------------------------------------------------+ || | | || | +------------------+ +------------------+ | || | | Availability | | Availability | | || | | Zone A | | Zone B | | || | | | | | | || | | +--------+ | | +--------+ | | || | | | EC2 | | | | EC2 | | | || | | | Active | | | | Active | | | || | | +--------+ | | +--------+ | | || | | | | | | | | || | | v | | v | | || | | +--------+ | | +--------+ | | || | | | EC2 | | | | EC2 | | | || | | | Active | | | | Active | | | || | | +--------+ | | +--------+ | | || | | | | | | || | +--------+---------+ +---------+--------+ | || | | | | || | +----------+-------------+ | || | | | || | v | || | +----------------------------------------------------------+ || | | Application Load Balancer | || | | (Distributes traffic across all AZs) | || | +----------------------------------------------------------+ || | | | || | v | || | +----------------------------------------------------------+ || | | Aurora Database (Multi-AZ) | || | | +--------+ +--------+ +--------+ +--------+ | || | | |Primary | |Replica1| |Replica2| |Replica3| | || | | +--------+ +--------+ +--------+ +--------+ | || | +----------------------------------------------------------+ || | | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+Auto Scaling for HA
Section titled “Auto Scaling for HA”# Auto Scaling Group ConfigurationResources: WebServerASG: Type: AWS::AutoScaling::AutoScalingGroup Properties: VPCZoneIdentifier: - subnet-az-a - subnet-az-b - subnet-az-c LaunchTemplate: LaunchTemplateId: !Ref WebServerLaunchTemplate Version: !GetAtt WebServerLaunchTemplate.LatestVersionNumber MinSize: 3 MaxSize: 12 DesiredCapacity: 6 HealthCheckType: ELB HealthCheckGracePeriod: 300 TargetGroupARNs: - !Ref WebServerTargetGroup Tags: - Key: Name Value: web-server PropagateAtLaunch: true
# Scale out policy ScaleOutPolicy: Type: AWS::AutoScaling::ScalingPolicy Properties: AutoScalingGroupName: !Ref WebServerASG PolicyType: TargetTrackingScaling TargetTrackingConfiguration: PredefinedMetricSpecification: PredefinedMetricType: ASGAverageCPUUtilization TargetValue: 70 ScaleOutCooldown: 60 ScaleInCooldown: 300
# Scale on ALB request count ScaleOnRequestCount: Type: AWS::AutoScaling::ScalingPolicy Properties: AutoScalingGroupName: !Ref WebServerASG PolicyType: TargetTrackingScaling TargetTrackingConfiguration: PredefinedMetricSpecification: PredefinedMetricType: ALBRequestCountPerTarget ResourceLabel: !Sub "${WebServerLoadBalancer.LoadBalancerArn}/${WebServerTargetGroup.TargetGroupArn}" TargetValue: 100046.4 Disaster Recovery Strategies
Section titled “46.4 Disaster Recovery Strategies”DR Strategy Comparison
Section titled “DR Strategy Comparison” DR Strategies+------------------------------------------------------------------+| || Strategy | RTO | RPO | Cost || -------------------+------------+------------+------------------ || Backup & Restore | Hours-Days | Hours-Days | $ || Pilot Light | Hours | Hours | $$ || Warm Standby | Minutes | Minutes | $$$ || Multi-Region | Seconds | Seconds | $$$$ || Active-Active | Real-time | Zero | $$$$$ || |+------------------------------------------------------------------+Backup & Restore
Section titled “Backup & Restore” Backup & Restore Strategy+------------------------------------------------------------------+| || Primary Region || +----------------------------------------------------------+ || | | || | +----------+ +----------+ +----------+ | || | | EC2 | | RDS | | S3 | | || | | Instances| | Database | | Buckets | | || | +----+-----+ +----+-----+ +----+-----+ | || | | | | | || | v v v | || | +----------------------------------------------------+ | || | | Backup Process | | || | | - EBS Snapshots | | || | | - RDS Snapshots | | || | | - S3 Cross-Region Replication | | || | +----------------------------------------------------+ | || | | | | | || +-------+------------+------------+-------------------------+ || | | | || v v v || +----------------------------------------------------------+ || | S3 (Backup Storage) | || +----------------------------------------------------------+ || | | | || v v v || DR Region || +----------------------------------------------------------+ || | | || | Recovery Process: | || | 1. Restore EBS snapshots to new volumes | || | 2. Restore RDS snapshot to new instance | || | 3. Create EC2 instances from AMI | || | 4. Update DNS to point to DR region | || | | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+Pilot Light
Section titled “Pilot Light” Pilot Light Strategy+------------------------------------------------------------------+| || Primary Region DR Region || +------------------------+ +------------------------+ || | | | | || | +------------------+ | | +------------------+ | || | | Application Tier | | | | Application Tier | | || | | | | | | | | || | | +--------+ | | | | +--------+ | | || | | | EC2 | | | | | | EC2 | | | || | | | Active | | | | | | Stopped| | | || | | +--------+ | | | | +--------+ | | || | | +--------+ | | | | +--------+ | | || | | | EC2 | | | | | | EC2 | | | || | | | Active | | | | | | Stopped| | | || | | +--------+ | | | | +--------+ | | || | +------------------+ | | +------------------+ | || | | | | ^ | || | v | | | | || | +------------------+ | | +------------------+ | || | | Database Tier | | | | Database Tier | | || | | | | | | | | || | | +--------+ | | | | +--------+ | | || | | | RDS |------|--|------>| | RDS | | | || | | | Primary| Async| | Rep | | Replica| | | || | | +--------+ | | | | +--------+ | | || | +------------------+ | | +------------------+ | || | | | | || +------------------------+ +------------------------+ || || Failover Process: || 1. Promote RDS replica to primary || 2. Start EC2 instances in DR region || 3. Update Route53 to point to DR region || 4. Scale out application tier as needed || |+------------------------------------------------------------------+Warm Standby
Section titled “Warm Standby” Warm Standby Strategy+------------------------------------------------------------------+| || Primary Region DR Region || +------------------------+ +------------------------+ || | | | | || | +------------------+ | | +------------------+ | || | | Application Tier | | | | Application Tier | | || | | | | | | | | || | | +--------+ | | | | +--------+ | | || | | | EC2 | | | | | | EC2 | | | || | | | Active | | | | | | Active | | | || | | +--------+ | | | | | (Scaled| | | || | | +--------+ | | | | | Down) | | | || | | | EC2 | | | | | +--------+ | | || | | | Active | | | | | +--------+ | | || | | +--------+ | | | | | EC2 | | | || | +------------------+ | | | | Active | | | || | | | | +------------------+ | || | v | | ^ | || | +------------------+ | | | | || | | Database Tier | | | +------------------+ | || | | | | | | Database Tier | | || | | +--------+ | | | | | | || | | | RDS |------|--|------>| | RDS | | | || | | | Primary| Async| | Rep | | Replica| | | || | | +--------+ | | | | +--------+ | | || | +------------------+ | | +------------------+ | || | | | | || +------------------------+ +------------------------+ || || Failover Process: || 1. Promote RDS replica to primary || 2. Scale out EC2 instances in DR region || 3. Update Route53 health checks || 4. Traffic automatically routes to DR || |+------------------------------------------------------------------+Multi-Region Active-Active
Section titled “Multi-Region Active-Active” Multi-Region Active-Active+------------------------------------------------------------------+| || Route53 || +----------------------------------------------------------+ || | Latency-Based Routing | || +----------------------------------------------------------+ || | | || v v || Primary Region DR Region || +------------------------+ +------------------------+ || | | | | || | +------------------+ | | +------------------+ | || | | Application Tier | | | | Application Tier | | || | | | | | | | | || | | +--------+ | | | | +--------+ | | || | | | EC2 | | | | | | EC2 | | | || | | | Active | | | | | | Active | | | || | | +--------+ | | | | +--------+ | | || | | +--------+ | | | | +--------+ | | || | | | EC2 | | | | | | EC2 | | | || | | | Active | | | | | | Active | | | || | | +--------+ | | | | +--------+ | | || | +------------------+ | | +------------------+ | || | | | | | | || | v | | v | || | +------------------+ | | +------------------+ | || | | Database Tier | | | | Database Tier | | || | | | | | | | | || | | +--------+ | | | | +--------+ | | || | | | Aurora |<-----|--|------>| | Aurora | | | || | | | Primary| Sync | | Rep | | Primary| | | || | | +--------+ | | | | +--------+ | | || | +------------------+ | | +------------------+ | || | | | | || +------------------------+ +------------------------+ || || Features: || - Global load balancing with Route53 || - Aurora Global Database for multi-region || - DynamoDB Global Tables for NoSQL || - S3 Cross-Region Replication || |+------------------------------------------------------------------+46.5 Database HA & DR
Section titled “46.5 Database HA & DR”RDS Multi-AZ
Section titled “RDS Multi-AZ”# RDS Multi-AZ ConfigurationResources: PrimaryDB: Type: AWS::RDS::DBInstance Properties: DBInstanceIdentifier: primary-db Engine: postgres EngineVersion: "14.7" DBInstanceClass: db.r6g.xlarge AllocatedStorage: 100 StorageType: gp3 MultiAZ: true AvailabilityZone: us-east-1a PrimaryDBInstanceIdentifier: !Ref PrimaryDB DBSubnetGroupName: !Ref DBSubnetGroup VPCSecurityGroups: - !Ref DBSecurityGroup BackupRetentionPeriod: 7 BackupWindow: "03:00-04:00" MaintenanceWindow: "sun:04:00-sun:05:00" DeletionProtection: true StorageEncrypted: true KmsKeyId: !Ref DBKMSKey
DBSubnetGroup: Type: AWS::RDS::DBSubnetGroup Properties: DBSubnetGroupDescription: DB subnet group SubnetIds: - subnet-az-a - subnet-az-b - subnet-az-cAurora Global Database
Section titled “Aurora Global Database”# Aurora Global DatabaseResources: AuroraCluster: Type: AWS::RDS::DBCluster Properties: Engine: aurora-postgresql EngineVersion: "14.7" DatabaseName: appdb MasterUsername: admin MasterUserPassword: !Ref DBPassword DBClusterParameterGroupName: default.aurora-postgresql14 VpcSecurityGroupIds: - !Ref DBSecurityGroup DBSubnetGroupName: !Ref DBSubnetGroup EnableCloudwatchLogsExports: - postgresql BackupRetentionPeriod: 35 PreferredBackupWindow: "03:00-04:00" PreferredMaintenanceWindow: "sun:04:00-sun:05:00" DeletionProtection: true StorageEncrypted: true KmsKeyId: !Ref DBKMSKey
# Primary region instances AuroraPrimaryInstance1: Type: AWS::RDS::DBInstance Properties: DBClusterIdentifier: !Ref AuroraCluster Engine: aurora-postgresql DBInstanceClass: db.r6g.xlarge AvailabilityZone: us-east-1a
AuroraPrimaryInstance2: Type: AWS::RDS::DBInstance Properties: DBClusterIdentifier: !Ref AuroraCluster Engine: aurora-postgresql DBInstanceClass: db.r6g.xlarge AvailabilityZone: us-east-1b
# Global cluster AuroraGlobalCluster: Type: AWS::RDS::GlobalCluster Properties: GlobalClusterIdentifier: app-global-cluster SourceDBClusterIdentifier: !Ref AuroraCluster StorageEncrypted: true
# Secondary region cluster (in us-west-2) AuroraSecondaryCluster: Type: AWS::RDS::DBCluster Properties: Engine: aurora-postgresql EngineVersion: "14.7" GlobalClusterIdentifier: !Ref AuroraGlobalCluster Region: us-west-2 DBSubnetGroupName: !Ref DBSubnetGroupDR VpcSecurityGroupIds: - !Ref DBSecurityGroupDRDynamoDB Global Tables
Section titled “DynamoDB Global Tables”# DynamoDB Global TableResources: GlobalTable: Type: AWS::DynamoDB::GlobalTable Properties: TableName: ApplicationData AttributeDefinitions: - AttributeName: PK AttributeType: S - AttributeName: SK AttributeType: S KeySchema: - AttributeName: PK KeyType: HASH - AttributeName: SK KeyType: RANGE BillingMode: PAY_PER_REQUEST StreamSpecification: StreamViewType: NEW_AND_OLD_IMAGES Replicas: - Region: us-east-1 PointInTimeRecoverySpecification: PointInTimeRecoveryEnabled: true - Region: us-west-2 PointInTimeRecoverySpecification: PointInTimeRecoveryEnabled: true - Region: eu-west-1 PointInTimeRecoverySpecification: PointInTimeRecoveryEnabled: true SSESpecification: SSEEnabled: true SSEType: KMS46.6 Application HA & DR
Section titled “46.6 Application HA & DR”Load Balancer Configuration
Section titled “Load Balancer Configuration”# Multi-AZ Load BalancerResources: ApplicationLoadBalancer: Type: AWS::ElasticLoadBalancingV2::LoadBalancer Properties: Name: web-app-alb Type: application Scheme: internet-facing IpAddressType: ipv4 SecurityGroups: - !Ref ALBSecurityGroup Subnets: - subnet-az-a - subnet-az-b - subnet-az-c LoadBalancerAttributes: - Key: idle_timeout.timeout_seconds Value: "60" - Key: deletion_protection.enabled Value: "true" - Key: routing.http2.enabled Value: "true"
# Target group with health checks WebServerTargetGroup: Type: AWS::ElasticLoadBalancingV2::TargetGroup Properties: Name: web-servers Port: 80 Protocol: HTTP VpcId: !Ref VPC HealthCheckEnabled: true HealthCheckIntervalSeconds: 30 HealthCheckPath: /health HealthCheckPort: 80 HealthCheckProtocol: HTTP HealthCheckTimeoutSeconds: 10 HealthyThresholdCount: 3 UnhealthyThresholdCount: 3 TargetGroupAttributes: - Key: deregistration_delay.timeout_seconds Value: "30" - Key: stickiness.enabled Value: "false"
# Listener with failover HTTPListener: Type: AWS::ElasticLoadBalancingV2::Listener Properties: LoadBalancerArn: !Ref ApplicationLoadBalancer Port: 80 Protocol: HTTP DefaultActions: - Type: forward TargetGroupArn: !Ref WebServerTargetGroupRoute53 Health Checks & Failover
Section titled “Route53 Health Checks & Failover”# Route53 Failover ConfigurationResources: PrimaryHealthCheck: Type: AWS::Route53::HealthCheck Properties: HealthCheckConfig: Type: HTTPS FullyQualifiedDomainName: primary.example.com Port: 443 ResourcePath: /health FailureThreshold: 3 RequestInterval: 30 MeasureLatency: true HealthCheckTags: - Key: Name Value: primary-health-check
DRHealthCheck: Type: AWS::Route53::HealthCheck Properties: HealthCheckConfig: Type: HTTPS FullyQualifiedDomainName: dr.example.com Port: 443 ResourcePath: /health FailureThreshold: 3 RequestInterval: 30 MeasureLatency: true HealthCheckTags: - Key: Name Value: dr-health-check
# Primary record set PrimaryRecordSet: Type: AWS::Route53::RecordSet Properties: HostedZoneId: !Ref HostedZone Name: www.example.com Type: A AliasTarget: HostedZoneId: !GetAtt ApplicationLoadBalancer.CanonicalHostedZoneID DNSName: !GetAtt ApplicationLoadBalancer.DNSName EvaluateTargetHealth: true Failover: PRIMARY HealthCheckId: !Ref PrimaryHealthCheck SetIdentifier: primary
# DR record set DRRecordSet: Type: AWS::Route53::RecordSet Properties: HostedZoneId: !Ref HostedZone Name: www.example.com Type: A AliasTarget: HostedZoneId: !GetAtt DRApplicationLoadBalancer.CanonicalHostedZoneID DNSName: !GetAtt DRApplicationLoadBalancer.DNSName EvaluateTargetHealth: true Failover: SECONDARY HealthCheckId: !Ref DRHealthCheck SetIdentifier: dr46.7 Backup Strategies
Section titled “46.7 Backup Strategies”AWS Backup Configuration
Section titled “AWS Backup Configuration”# AWS Backup ConfigurationResources: BackupVault: Type: AWS::Backup::BackupVault Properties: BackupVaultName: ApplicationBackupVault EncryptionKeyArn: !Ref BackupKMSKey
# Daily backup plan DailyBackupPlan: Type: AWS::Backup::BackupPlan Properties: BackupPlan: BackupPlanName: DailyBackupPlan BackupPlanRule: - RuleName: DailyBackupRule TargetBackupVault: !Ref BackupVault ScheduleExpression: "cron(0 5 ? * * *)" # Daily at 5 AM UTC StartWindowMinutes: 60 CompletionWindowMinutes: 120 Lifecycle: DeleteAfterDays: 30 MoveToColdStorageAfterDays: 7 CopyActions: - DestinationBackupVaultArn: !Ref DRBackupVault Lifecycle: DeleteAfterDays: 90
# Weekly backup plan WeeklyBackupPlan: Type: AWS::Backup::BackupPlan Properties: BackupPlan: BackupPlanName: WeeklyBackupPlan BackupPlanRule: - RuleName: WeeklyBackupRule TargetBackupVault: !Ref BackupVault ScheduleExpression: "cron(0 5 ? * SUN *)" # Weekly on Sunday StartWindowMinutes: 60 CompletionWindowMinutes: 240 Lifecycle: DeleteAfterDays: 365 MoveToColdStorageAfterDays: 30
# Backup selection BackupSelection: Type: AWS::Backup::BackupSelection Properties: BackupPlanId: !Ref DailyBackupPlan BackupSelection: SelectionName: ApplicationResources IamRoleArn: !Ref BackupServiceRole Resources: - !Ref PrimaryDB - !Ref EBSVolume1 - !Ref EBSVolume2 ListOfTags: - ConditionType: STRINGEQUALS ConditionKey: Backup ConditionValue: requiredS3 Cross-Region Replication
Section titled “S3 Cross-Region Replication”# S3 Cross-Region ReplicationResources: PrimaryBucket: Type: AWS::S3::Bucket Properties: BucketName: primary-data-bucket VersioningConfiguration: Status: Enabled ReplicationConfiguration: Role: !Ref ReplicationRole Rules: - Id: ReplicateAll Status: Enabled Priority: 1 DeleteMarkerReplication: Status: Enabled Destination: Bucket: !Ref DRBucket StorageClass: STANDARD ReplicationTime: Status: Enabled Time: 15 # Replicate within 15 minutes Metrics: Status: Enabled LifecycleConfiguration: Rules: - Id: TransitionToGlacier Status: Enabled Transitions: - TransitionInDays: 90 StorageClass: GLACIER
DRBucket: Type: AWS::S3::Bucket Properties: BucketName: dr-data-bucket VersioningConfiguration: Status: Enabled PublicAccessBlockConfiguration: BlockPublicAcls: true BlockPublicPolicy: true IgnorePublicAcls: true RestrictPublicBuckets: true46.8 Failover Automation
Section titled “46.8 Failover Automation”Lambda Failover Function
Section titled “Lambda Failover Function”import boto3import jsonimport logging
logger = logging.getLogger()logger.setLevel(logging.INFO)
def lambda_handler(event, context): """ Automated failover function """
rds = boto3.client('rds') route53 = boto3.client('route53') autoscaling = boto3.client('autoscaling')
# Parse event primary_region = event.get('primary_region', 'us-east-1') dr_region = event.get('dr_region', 'us-west-2') db_cluster = event.get('db_cluster') hosted_zone_id = event.get('hosted_zone_id') record_name = event.get('record_name') asg_name = event.get('asg_name')
try: # 1. Promote RDS replica to primary logger.info(f"Promoting RDS replica in {dr_region}") rds.promote_read_replica_db_cluster( DBClusterIdentifier=db_cluster )
# 2. Scale out DR Auto Scaling Group logger.info(f"Scaling out ASG {asg_name}") autoscaling.set_desired_capacity( AutoScalingGroupName=asg_name, DesiredCapacity=6, HonorCooldown=False )
# 3. Update Route53 record logger.info(f"Updating Route53 record {record_name}") route53.change_resource_record_sets( HostedZoneId=hosted_zone_id, ChangeBatch={ 'Changes': [ { 'Action': 'UPSERT', 'ResourceRecordSet': { 'Name': record_name, 'Type': 'A', 'AliasTarget': { 'HostedZoneId': dr_alb_hosted_zone_id, 'DNSName': dr_alb_dns_name, 'EvaluateTargetHealth': True } } } ] } )
# 4. Send notification logger.info("Failover completed successfully") return { 'statusCode': 200, 'body': json.dumps({ 'message': 'Failover completed', 'primary_region': primary_region, 'dr_region': dr_region }) }
except Exception as e: logger.error(f"Failover failed: {str(e)}") return { 'statusCode': 500, 'body': json.dumps({ 'message': 'Failover failed', 'error': str(e) }) }CloudWatch Events for Failover
Section titled “CloudWatch Events for Failover”# CloudWatch Events for automated failoverResources: FailoverTrigger: Type: AWS::Events::Rule Properties: Name: FailoverTrigger Description: Trigger failover on primary region failure EventPattern: source: - aws.health detail-type: - AWS Health Event detail: eventType: - issue - accountNotification service: - EC2 - RDS statusCode: - open State: ENABLED Targets: - Id: FailoverFunction Arn: !Ref FailoverFunction InputTransformer: InputPathsMap: region: $.region service: $.detail.service InputTemplate: | { "primary_region": "us-east-1", "dr_region": "us-west-2", "service": "<service>", "trigger": "health_event" }46.9 Testing & Validation
Section titled “46.9 Testing & Validation”DR Testing Framework
Section titled “DR Testing Framework” DR Testing Framework+------------------------------------------------------------------+| || Test Types || +----------------------------------------------------------+ || | | || | 1. Tabletop Exercise | || | +-------------------------------------------------+ | || | | - Walk through DR procedures | | || | | - Identify gaps and issues | | || | | - Update documentation | | || | +-------------------------------------------------+ | || | | || | 2. Component Testing | || | +-------------------------------------------------+ | || | | - Test individual components | | || | | - Verify backup/restore | | || | | - Validate replication | | || | +-------------------------------------------------+ | || | | || | 3. Simulation Testing | || | +-------------------------------------------------+ | || | | - Simulate failure scenarios | | || | | - Test failover procedures | | || | | - Measure RTO/RPO | | || | +-------------------------------------------------+ | || | | || | 4. Full-Scale Testing | || | +-------------------------------------------------+ | || | | - Complete failover to DR | | || | | - Full production traffic | | || | | - Validate all systems | | || | +-------------------------------------------------+ | || | | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+Chaos Engineering with AWS FIS
Section titled “Chaos Engineering with AWS FIS”# AWS Fault Injection Simulator ExperimentResources: FISExperimentRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Principal: Service: fis.amazonaws.com Action: sts:AssumeRole ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorNetworkAccess
AZFailureExperiment: Type: AWS::FIS::ExperimentTemplate Properties: Description: Simulate AZ failure RoleArn: !GetAtt FISExperimentRole.Arn Targets: Instances: ResourceType: aws:ec2:instance ResourceTags: Environment: production SelectionMode: COUNT(1) Parameters: AvailabilityZone: us-east-1a Actions: TerminateInstance: ActionId: aws:ec2:terminate-instances Parameters: instanceIds: ${Instances} Targets: Instances: Instances StopConditions: - Source: aws:cloudwatch:alarm Value: !Ref HighErrorRateAlarm Tags: Name: AZ-Failure-Test46.10 Best Practices
Section titled “46.10 Best Practices”HA Design Principles
Section titled “HA Design Principles” HA Best Practices+------------------------------------------------------------------+| || 1. Design for Failure || +--------------------------------------------------------+ || | - Assume components will fail | || | - Implement redundancy at all layers | || | - Use managed services with built-in HA | || +--------------------------------------------------------+ || || 2. Implement Health Checks || +--------------------------------------------------------+ || | - Application-level health checks | || | - Load balancer health checks | || | - Route53 health checks | || | - Auto Scaling health checks | || +--------------------------------------------------------+ || || 3. Automate Recovery || +--------------------------------------------------------+ || | - Auto Scaling for self-healing | || | - Automated failover for databases | || | - Route53 automatic failover | || +--------------------------------------------------------+ || || 4. Test Regularly || +--------------------------------------------------------+ || | - Conduct regular DR tests | || | - Use chaos engineering | || | - Document and improve | || +--------------------------------------------------------+ || |+------------------------------------------------------------------+DR Checklist
Section titled “DR Checklist”# DR Readiness Checklist
## Infrastructure- [ ] Multi-AZ deployment for all tiers- [ ] Cross-region replication configured- [ ] Automated backups enabled- [ ] Backup retention meets RPO requirements
## Database- [ ] Multi-AZ RDS or Aurora configured- [ ] Read replicas in DR region- [ ] Cross-region replication tested- [ ] Point-in-time recovery enabled
## Application- [ ] Stateless application design- [ ] Auto Scaling configured- [ ] Load balancer health checks- [ ] Blue/green deployment capability
## Network- [ ] Route53 health checks configured- [ ] Failover routing policies- [ ] Cross-region VPC peering- [ ] VPN/Direct Connect redundancy
## Operations- [ ] DR runbook documented- [ ] Failover automation tested- [ ] Communication plan established- [ ] Regular DR drills scheduled46.11 Key Takeaways
Section titled “46.11 Key Takeaways”| Topic | Key Points |
|---|---|
| RTO/RPO | Define recovery objectives before designing |
| Multi-AZ | Deploy across multiple AZs for HA |
| DR Strategies | Choose strategy based on RTO/RPO requirements |
| Automation | Automate failover and recovery processes |
| Testing | Regularly test DR procedures |
| Monitoring | Implement comprehensive health checks |
46.12 References
Section titled “46.12 References”- AWS Disaster Recovery
- AWS Well-Architected Reliability Pillar
- AWS Backup
- AWS Fault Injection Simulator
Next Chapter: Chapter 47 - Cost Optimization & FinOps