Skip to content

Ha_dr

Chapter 46: High Availability & Disaster Recovery Architecture

Section titled “Chapter 46: High Availability & Disaster Recovery Architecture”

High Availability (HA) and Disaster Recovery (DR) are critical components of enterprise architecture, ensuring business continuity and minimal downtime during failures.

HA & DR Overview
+------------------------------------------------------------------+
| |
| +------------------------+ |
| | Resilient Design | |
| +------------------------+ |
| | |
| +---------------------+---------------------+ |
| | | | | |
| v v v v |
| +----------+ +----------+ +----------+ +----------+ |
| | High | | Disaster | | Fault | | Recovery | |
| | Availabi | | Recovery | | Tolerance| | Time | |
| | -lity | | | | | | Objectives| |
| | | | | | | | | |
| | - Uptime | | - DR Plan| | - Redund | | - RTO | |
| | - Redund | | - Backup | | - Failover| | - RPO | |
| | - Load | | - Restore| | - Self-Heal| | - SLA | |
| +----------+ +----------+ +----------+ +----------+ |
| |
+------------------------------------------------------------------+
ConceptDescription
RTORecovery Time Objective - Maximum acceptable downtime
RPORecovery Point Objective - Maximum acceptable data loss
SLAService Level Agreement - Contractual uptime guarantee
MTTRMean Time To Recovery - Average recovery time

Recovery Objectives
+------------------------------------------------------------------+
| |
| RTO (Recovery Time Objective) |
| +----------------------------------------------------------+ |
| | | |
| | Disaster Recovery Time Service Restored | |
| | +--------+ +----------------+ +----------+ | |
| | | Event | ----> | | --> | Restored | | |
| | +--------+ +----------------+ +----------+ | |
| | |<----- RTO ---->| | |
| | | |
| | Examples: | |
| | - Mission Critical: < 15 minutes | |
| | - Business Critical: < 1 hour | |
| | - Business Operational: < 4 hours | |
| | - Non-Critical: < 24 hours | |
| | | |
| +----------------------------------------------------------+ |
| |
| RPO (Recovery Point Objective) |
| +----------------------------------------------------------+ |
| | | |
| | Last Backup Data Loss Recovery Point | |
| | +--------+ +--------+ +----------+ | |
| | | Backup | ------->| Lost | --------> | Recover | | |
| | +--------+ +--------+ +----------+ | |
| | |<--------------- RPO --------------->| | |
| | | |
| | Examples: | |
| | - Zero Data Loss: RPO = 0 (synchronous replication) | |
| | - Near-Zero: RPO < 1 minute (async replication) | |
| | - Low: RPO < 1 hour (frequent backups) | |
| | - Standard: RPO < 24 hours (daily backups) | |
| | | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Availability SLA
+------------------------------------------------------------------+
| |
| Availability Percentage | Max Downtime per Year |
| ---------------------------+------------------------------------ |
| 99% (Two 9s) | 3.65 days (87.6 hours) |
| 99.9% (Three 9s) | 8.76 hours |
| 99.95% | 4.38 hours |
| 99.99% (Four 9s) | 52.6 minutes |
| 99.999% (Five 9s) | 5.26 minutes |
| |
| Calculation: |
| Availability = (Total Time - Downtime) / Total Time × 100 |
| |
| Example for 99.99% SLA: |
| - Per Year: 365 × 24 × 60 × (1 - 0.9999) = 52.6 minutes |
| - Per Month: 30 × 24 × 60 × (1 - 0.9999) = 4.3 minutes |
| - Per Week: 7 × 24 × 60 × (1 - 0.9999) = 1 minute |
| |
+------------------------------------------------------------------+

Multi-AZ Architecture
+------------------------------------------------------------------+
| |
| Region (us-east-1) |
| +----------------------------------------------------------+ |
| | | |
| | +------------------+ +------------------+ | |
| | | Availability | | Availability | | |
| | | Zone A | | Zone B | | |
| | | +--------------+ | | +--------------+ | | |
| | | | | | | | | | | |
| | | | +--------+ | | | | +--------+ | | | |
| | | | | EC2 | | | | | | EC2 | | | | |
| | | | | Active | | | | | | Standby| | | | |
| | | | +--------+ | | | | +--------+ | | | |
| | | | | | | | | | | |
| | | +------+-------+ | | +------+-------+ | | |
| | | | | | | | | |
| | +--------+---------+ +--------+---------+ | |
| | | | | |
| | v v | |
| | +----------------------------------------------------------+ |
| | | Application Load Balancer | |
| | +----------------------------------------------------------+ |
| | | | | |
| | v v | |
| | +------------------+ +------------------+ | |
| | | RDS Primary | | RDS Standby | | |
| | | (Sync Replication)|<-| (Multi-AZ) | | |
| | +------------------+ +------------------+ | |
| | | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Active-Active Architecture
+------------------------------------------------------------------+
| |
| Region (us-east-1) |
| +----------------------------------------------------------+ |
| | | |
| | +------------------+ +------------------+ | |
| | | Availability | | Availability | | |
| | | Zone A | | Zone B | | |
| | | | | | | |
| | | +--------+ | | +--------+ | | |
| | | | EC2 | | | | EC2 | | | |
| | | | Active | | | | Active | | | |
| | | +--------+ | | +--------+ | | |
| | | | | | | | | |
| | | v | | v | | |
| | | +--------+ | | +--------+ | | |
| | | | EC2 | | | | EC2 | | | |
| | | | Active | | | | Active | | | |
| | | +--------+ | | +--------+ | | |
| | | | | | | |
| | +--------+---------+ +---------+--------+ | |
| | | | | |
| | +----------+-------------+ | |
| | | | |
| | v | |
| | +----------------------------------------------------------+ |
| | | Application Load Balancer | |
| | | (Distributes traffic across all AZs) | |
| | +----------------------------------------------------------+ |
| | | | |
| | v | |
| | +----------------------------------------------------------+ |
| | | Aurora Database (Multi-AZ) | |
| | | +--------+ +--------+ +--------+ +--------+ | |
| | | |Primary | |Replica1| |Replica2| |Replica3| | |
| | | +--------+ +--------+ +--------+ +--------+ | |
| | +----------------------------------------------------------+ |
| | | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
# Auto Scaling Group Configuration
Resources:
WebServerASG:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
VPCZoneIdentifier:
- subnet-az-a
- subnet-az-b
- subnet-az-c
LaunchTemplate:
LaunchTemplateId: !Ref WebServerLaunchTemplate
Version: !GetAtt WebServerLaunchTemplate.LatestVersionNumber
MinSize: 3
MaxSize: 12
DesiredCapacity: 6
HealthCheckType: ELB
HealthCheckGracePeriod: 300
TargetGroupARNs:
- !Ref WebServerTargetGroup
Tags:
- Key: Name
Value: web-server
PropagateAtLaunch: true
# Scale out policy
ScaleOutPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AutoScalingGroupName: !Ref WebServerASG
PolicyType: TargetTrackingScaling
TargetTrackingConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ASGAverageCPUUtilization
TargetValue: 70
ScaleOutCooldown: 60
ScaleInCooldown: 300
# Scale on ALB request count
ScaleOnRequestCount:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AutoScalingGroupName: !Ref WebServerASG
PolicyType: TargetTrackingScaling
TargetTrackingConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ALBRequestCountPerTarget
ResourceLabel: !Sub "${WebServerLoadBalancer.LoadBalancerArn}/${WebServerTargetGroup.TargetGroupArn}"
TargetValue: 1000

DR Strategies
+------------------------------------------------------------------+
| |
| Strategy | RTO | RPO | Cost |
| -------------------+------------+------------+------------------ |
| Backup & Restore | Hours-Days | Hours-Days | $ |
| Pilot Light | Hours | Hours | $$ |
| Warm Standby | Minutes | Minutes | $$$ |
| Multi-Region | Seconds | Seconds | $$$$ |
| Active-Active | Real-time | Zero | $$$$$ |
| |
+------------------------------------------------------------------+
Backup & Restore Strategy
+------------------------------------------------------------------+
| |
| Primary Region |
| +----------------------------------------------------------+ |
| | | |
| | +----------+ +----------+ +----------+ | |
| | | EC2 | | RDS | | S3 | | |
| | | Instances| | Database | | Buckets | | |
| | +----+-----+ +----+-----+ +----+-----+ | |
| | | | | | |
| | v v v | |
| | +----------------------------------------------------+ | |
| | | Backup Process | | |
| | | - EBS Snapshots | | |
| | | - RDS Snapshots | | |
| | | - S3 Cross-Region Replication | | |
| | +----------------------------------------------------+ | |
| | | | | | |
| +-------+------------+------------+-------------------------+ |
| | | | |
| v v v |
| +----------------------------------------------------------+ |
| | S3 (Backup Storage) | |
| +----------------------------------------------------------+ |
| | | | |
| v v v |
| DR Region |
| +----------------------------------------------------------+ |
| | | |
| | Recovery Process: | |
| | 1. Restore EBS snapshots to new volumes | |
| | 2. Restore RDS snapshot to new instance | |
| | 3. Create EC2 instances from AMI | |
| | 4. Update DNS to point to DR region | |
| | | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Pilot Light Strategy
+------------------------------------------------------------------+
| |
| Primary Region DR Region |
| +------------------------+ +------------------------+ |
| | | | | |
| | +------------------+ | | +------------------+ | |
| | | Application Tier | | | | Application Tier | | |
| | | | | | | | | |
| | | +--------+ | | | | +--------+ | | |
| | | | EC2 | | | | | | EC2 | | | |
| | | | Active | | | | | | Stopped| | | |
| | | +--------+ | | | | +--------+ | | |
| | | +--------+ | | | | +--------+ | | |
| | | | EC2 | | | | | | EC2 | | | |
| | | | Active | | | | | | Stopped| | | |
| | | +--------+ | | | | +--------+ | | |
| | +------------------+ | | +------------------+ | |
| | | | | ^ | |
| | v | | | | |
| | +------------------+ | | +------------------+ | |
| | | Database Tier | | | | Database Tier | | |
| | | | | | | | | |
| | | +--------+ | | | | +--------+ | | |
| | | | RDS |------|--|------>| | RDS | | | |
| | | | Primary| Async| | Rep | | Replica| | | |
| | | +--------+ | | | | +--------+ | | |
| | +------------------+ | | +------------------+ | |
| | | | | |
| +------------------------+ +------------------------+ |
| |
| Failover Process: |
| 1. Promote RDS replica to primary |
| 2. Start EC2 instances in DR region |
| 3. Update Route53 to point to DR region |
| 4. Scale out application tier as needed |
| |
+------------------------------------------------------------------+
Warm Standby Strategy
+------------------------------------------------------------------+
| |
| Primary Region DR Region |
| +------------------------+ +------------------------+ |
| | | | | |
| | +------------------+ | | +------------------+ | |
| | | Application Tier | | | | Application Tier | | |
| | | | | | | | | |
| | | +--------+ | | | | +--------+ | | |
| | | | EC2 | | | | | | EC2 | | | |
| | | | Active | | | | | | Active | | | |
| | | +--------+ | | | | | (Scaled| | | |
| | | +--------+ | | | | | Down) | | | |
| | | | EC2 | | | | | +--------+ | | |
| | | | Active | | | | | +--------+ | | |
| | | +--------+ | | | | | EC2 | | | |
| | +------------------+ | | | | Active | | | |
| | | | | +------------------+ | |
| | v | | ^ | |
| | +------------------+ | | | | |
| | | Database Tier | | | +------------------+ | |
| | | | | | | Database Tier | | |
| | | +--------+ | | | | | | |
| | | | RDS |------|--|------>| | RDS | | | |
| | | | Primary| Async| | Rep | | Replica| | | |
| | | +--------+ | | | | +--------+ | | |
| | +------------------+ | | +------------------+ | |
| | | | | |
| +------------------------+ +------------------------+ |
| |
| Failover Process: |
| 1. Promote RDS replica to primary |
| 2. Scale out EC2 instances in DR region |
| 3. Update Route53 health checks |
| 4. Traffic automatically routes to DR |
| |
+------------------------------------------------------------------+
Multi-Region Active-Active
+------------------------------------------------------------------+
| |
| Route53 |
| +----------------------------------------------------------+ |
| | Latency-Based Routing | |
| +----------------------------------------------------------+ |
| | | |
| v v |
| Primary Region DR Region |
| +------------------------+ +------------------------+ |
| | | | | |
| | +------------------+ | | +------------------+ | |
| | | Application Tier | | | | Application Tier | | |
| | | | | | | | | |
| | | +--------+ | | | | +--------+ | | |
| | | | EC2 | | | | | | EC2 | | | |
| | | | Active | | | | | | Active | | | |
| | | +--------+ | | | | +--------+ | | |
| | | +--------+ | | | | +--------+ | | |
| | | | EC2 | | | | | | EC2 | | | |
| | | | Active | | | | | | Active | | | |
| | | +--------+ | | | | +--------+ | | |
| | +------------------+ | | +------------------+ | |
| | | | | | | |
| | v | | v | |
| | +------------------+ | | +------------------+ | |
| | | Database Tier | | | | Database Tier | | |
| | | | | | | | | |
| | | +--------+ | | | | +--------+ | | |
| | | | Aurora |<-----|--|------>| | Aurora | | | |
| | | | Primary| Sync | | Rep | | Primary| | | |
| | | +--------+ | | | | +--------+ | | |
| | +------------------+ | | +------------------+ | |
| | | | | |
| +------------------------+ +------------------------+ |
| |
| Features: |
| - Global load balancing with Route53 |
| - Aurora Global Database for multi-region |
| - DynamoDB Global Tables for NoSQL |
| - S3 Cross-Region Replication |
| |
+------------------------------------------------------------------+

# RDS Multi-AZ Configuration
Resources:
PrimaryDB:
Type: AWS::RDS::DBInstance
Properties:
DBInstanceIdentifier: primary-db
Engine: postgres
EngineVersion: "14.7"
DBInstanceClass: db.r6g.xlarge
AllocatedStorage: 100
StorageType: gp3
MultiAZ: true
AvailabilityZone: us-east-1a
PrimaryDBInstanceIdentifier: !Ref PrimaryDB
DBSubnetGroupName: !Ref DBSubnetGroup
VPCSecurityGroups:
- !Ref DBSecurityGroup
BackupRetentionPeriod: 7
BackupWindow: "03:00-04:00"
MaintenanceWindow: "sun:04:00-sun:05:00"
DeletionProtection: true
StorageEncrypted: true
KmsKeyId: !Ref DBKMSKey
DBSubnetGroup:
Type: AWS::RDS::DBSubnetGroup
Properties:
DBSubnetGroupDescription: DB subnet group
SubnetIds:
- subnet-az-a
- subnet-az-b
- subnet-az-c
# Aurora Global Database
Resources:
AuroraCluster:
Type: AWS::RDS::DBCluster
Properties:
Engine: aurora-postgresql
EngineVersion: "14.7"
DatabaseName: appdb
MasterUsername: admin
MasterUserPassword: !Ref DBPassword
DBClusterParameterGroupName: default.aurora-postgresql14
VpcSecurityGroupIds:
- !Ref DBSecurityGroup
DBSubnetGroupName: !Ref DBSubnetGroup
EnableCloudwatchLogsExports:
- postgresql
BackupRetentionPeriod: 35
PreferredBackupWindow: "03:00-04:00"
PreferredMaintenanceWindow: "sun:04:00-sun:05:00"
DeletionProtection: true
StorageEncrypted: true
KmsKeyId: !Ref DBKMSKey
# Primary region instances
AuroraPrimaryInstance1:
Type: AWS::RDS::DBInstance
Properties:
DBClusterIdentifier: !Ref AuroraCluster
Engine: aurora-postgresql
DBInstanceClass: db.r6g.xlarge
AvailabilityZone: us-east-1a
AuroraPrimaryInstance2:
Type: AWS::RDS::DBInstance
Properties:
DBClusterIdentifier: !Ref AuroraCluster
Engine: aurora-postgresql
DBInstanceClass: db.r6g.xlarge
AvailabilityZone: us-east-1b
# Global cluster
AuroraGlobalCluster:
Type: AWS::RDS::GlobalCluster
Properties:
GlobalClusterIdentifier: app-global-cluster
SourceDBClusterIdentifier: !Ref AuroraCluster
StorageEncrypted: true
# Secondary region cluster (in us-west-2)
AuroraSecondaryCluster:
Type: AWS::RDS::DBCluster
Properties:
Engine: aurora-postgresql
EngineVersion: "14.7"
GlobalClusterIdentifier: !Ref AuroraGlobalCluster
Region: us-west-2
DBSubnetGroupName: !Ref DBSubnetGroupDR
VpcSecurityGroupIds:
- !Ref DBSecurityGroupDR
# DynamoDB Global Table
Resources:
GlobalTable:
Type: AWS::DynamoDB::GlobalTable
Properties:
TableName: ApplicationData
AttributeDefinitions:
- AttributeName: PK
AttributeType: S
- AttributeName: SK
AttributeType: S
KeySchema:
- AttributeName: PK
KeyType: HASH
- AttributeName: SK
KeyType: RANGE
BillingMode: PAY_PER_REQUEST
StreamSpecification:
StreamViewType: NEW_AND_OLD_IMAGES
Replicas:
- Region: us-east-1
PointInTimeRecoverySpecification:
PointInTimeRecoveryEnabled: true
- Region: us-west-2
PointInTimeRecoverySpecification:
PointInTimeRecoveryEnabled: true
- Region: eu-west-1
PointInTimeRecoverySpecification:
PointInTimeRecoveryEnabled: true
SSESpecification:
SSEEnabled: true
SSEType: KMS

# Multi-AZ Load Balancer
Resources:
ApplicationLoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
Name: web-app-alb
Type: application
Scheme: internet-facing
IpAddressType: ipv4
SecurityGroups:
- !Ref ALBSecurityGroup
Subnets:
- subnet-az-a
- subnet-az-b
- subnet-az-c
LoadBalancerAttributes:
- Key: idle_timeout.timeout_seconds
Value: "60"
- Key: deletion_protection.enabled
Value: "true"
- Key: routing.http2.enabled
Value: "true"
# Target group with health checks
WebServerTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
Name: web-servers
Port: 80
Protocol: HTTP
VpcId: !Ref VPC
HealthCheckEnabled: true
HealthCheckIntervalSeconds: 30
HealthCheckPath: /health
HealthCheckPort: 80
HealthCheckProtocol: HTTP
HealthCheckTimeoutSeconds: 10
HealthyThresholdCount: 3
UnhealthyThresholdCount: 3
TargetGroupAttributes:
- Key: deregistration_delay.timeout_seconds
Value: "30"
- Key: stickiness.enabled
Value: "false"
# Listener with failover
HTTPListener:
Type: AWS::ElasticLoadBalancingV2::Listener
Properties:
LoadBalancerArn: !Ref ApplicationLoadBalancer
Port: 80
Protocol: HTTP
DefaultActions:
- Type: forward
TargetGroupArn: !Ref WebServerTargetGroup
# Route53 Failover Configuration
Resources:
PrimaryHealthCheck:
Type: AWS::Route53::HealthCheck
Properties:
HealthCheckConfig:
Type: HTTPS
FullyQualifiedDomainName: primary.example.com
Port: 443
ResourcePath: /health
FailureThreshold: 3
RequestInterval: 30
MeasureLatency: true
HealthCheckTags:
- Key: Name
Value: primary-health-check
DRHealthCheck:
Type: AWS::Route53::HealthCheck
Properties:
HealthCheckConfig:
Type: HTTPS
FullyQualifiedDomainName: dr.example.com
Port: 443
ResourcePath: /health
FailureThreshold: 3
RequestInterval: 30
MeasureLatency: true
HealthCheckTags:
- Key: Name
Value: dr-health-check
# Primary record set
PrimaryRecordSet:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneId: !Ref HostedZone
Name: www.example.com
Type: A
AliasTarget:
HostedZoneId: !GetAtt ApplicationLoadBalancer.CanonicalHostedZoneID
DNSName: !GetAtt ApplicationLoadBalancer.DNSName
EvaluateTargetHealth: true
Failover: PRIMARY
HealthCheckId: !Ref PrimaryHealthCheck
SetIdentifier: primary
# DR record set
DRRecordSet:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneId: !Ref HostedZone
Name: www.example.com
Type: A
AliasTarget:
HostedZoneId: !GetAtt DRApplicationLoadBalancer.CanonicalHostedZoneID
DNSName: !GetAtt DRApplicationLoadBalancer.DNSName
EvaluateTargetHealth: true
Failover: SECONDARY
HealthCheckId: !Ref DRHealthCheck
SetIdentifier: dr

# AWS Backup Configuration
Resources:
BackupVault:
Type: AWS::Backup::BackupVault
Properties:
BackupVaultName: ApplicationBackupVault
EncryptionKeyArn: !Ref BackupKMSKey
# Daily backup plan
DailyBackupPlan:
Type: AWS::Backup::BackupPlan
Properties:
BackupPlan:
BackupPlanName: DailyBackupPlan
BackupPlanRule:
- RuleName: DailyBackupRule
TargetBackupVault: !Ref BackupVault
ScheduleExpression: "cron(0 5 ? * * *)" # Daily at 5 AM UTC
StartWindowMinutes: 60
CompletionWindowMinutes: 120
Lifecycle:
DeleteAfterDays: 30
MoveToColdStorageAfterDays: 7
CopyActions:
- DestinationBackupVaultArn: !Ref DRBackupVault
Lifecycle:
DeleteAfterDays: 90
# Weekly backup plan
WeeklyBackupPlan:
Type: AWS::Backup::BackupPlan
Properties:
BackupPlan:
BackupPlanName: WeeklyBackupPlan
BackupPlanRule:
- RuleName: WeeklyBackupRule
TargetBackupVault: !Ref BackupVault
ScheduleExpression: "cron(0 5 ? * SUN *)" # Weekly on Sunday
StartWindowMinutes: 60
CompletionWindowMinutes: 240
Lifecycle:
DeleteAfterDays: 365
MoveToColdStorageAfterDays: 30
# Backup selection
BackupSelection:
Type: AWS::Backup::BackupSelection
Properties:
BackupPlanId: !Ref DailyBackupPlan
BackupSelection:
SelectionName: ApplicationResources
IamRoleArn: !Ref BackupServiceRole
Resources:
- !Ref PrimaryDB
- !Ref EBSVolume1
- !Ref EBSVolume2
ListOfTags:
- ConditionType: STRINGEQUALS
ConditionKey: Backup
ConditionValue: required
# S3 Cross-Region Replication
Resources:
PrimaryBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: primary-data-bucket
VersioningConfiguration:
Status: Enabled
ReplicationConfiguration:
Role: !Ref ReplicationRole
Rules:
- Id: ReplicateAll
Status: Enabled
Priority: 1
DeleteMarkerReplication:
Status: Enabled
Destination:
Bucket: !Ref DRBucket
StorageClass: STANDARD
ReplicationTime:
Status: Enabled
Time: 15 # Replicate within 15 minutes
Metrics:
Status: Enabled
LifecycleConfiguration:
Rules:
- Id: TransitionToGlacier
Status: Enabled
Transitions:
- TransitionInDays: 90
StorageClass: GLACIER
DRBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: dr-data-bucket
VersioningConfiguration:
Status: Enabled
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true

failover_function.py
import boto3
import json
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
"""
Automated failover function
"""
rds = boto3.client('rds')
route53 = boto3.client('route53')
autoscaling = boto3.client('autoscaling')
# Parse event
primary_region = event.get('primary_region', 'us-east-1')
dr_region = event.get('dr_region', 'us-west-2')
db_cluster = event.get('db_cluster')
hosted_zone_id = event.get('hosted_zone_id')
record_name = event.get('record_name')
asg_name = event.get('asg_name')
try:
# 1. Promote RDS replica to primary
logger.info(f"Promoting RDS replica in {dr_region}")
rds.promote_read_replica_db_cluster(
DBClusterIdentifier=db_cluster
)
# 2. Scale out DR Auto Scaling Group
logger.info(f"Scaling out ASG {asg_name}")
autoscaling.set_desired_capacity(
AutoScalingGroupName=asg_name,
DesiredCapacity=6,
HonorCooldown=False
)
# 3. Update Route53 record
logger.info(f"Updating Route53 record {record_name}")
route53.change_resource_record_sets(
HostedZoneId=hosted_zone_id,
ChangeBatch={
'Changes': [
{
'Action': 'UPSERT',
'ResourceRecordSet': {
'Name': record_name,
'Type': 'A',
'AliasTarget': {
'HostedZoneId': dr_alb_hosted_zone_id,
'DNSName': dr_alb_dns_name,
'EvaluateTargetHealth': True
}
}
}
]
}
)
# 4. Send notification
logger.info("Failover completed successfully")
return {
'statusCode': 200,
'body': json.dumps({
'message': 'Failover completed',
'primary_region': primary_region,
'dr_region': dr_region
})
}
except Exception as e:
logger.error(f"Failover failed: {str(e)}")
return {
'statusCode': 500,
'body': json.dumps({
'message': 'Failover failed',
'error': str(e)
})
}
# CloudWatch Events for automated failover
Resources:
FailoverTrigger:
Type: AWS::Events::Rule
Properties:
Name: FailoverTrigger
Description: Trigger failover on primary region failure
EventPattern:
source:
- aws.health
detail-type:
- AWS Health Event
detail:
eventType:
- issue
- accountNotification
service:
- EC2
- RDS
statusCode:
- open
State: ENABLED
Targets:
- Id: FailoverFunction
Arn: !Ref FailoverFunction
InputTransformer:
InputPathsMap:
region: $.region
service: $.detail.service
InputTemplate: |
{
"primary_region": "us-east-1",
"dr_region": "us-west-2",
"service": "<service>",
"trigger": "health_event"
}

DR Testing Framework
+------------------------------------------------------------------+
| |
| Test Types |
| +----------------------------------------------------------+ |
| | | |
| | 1. Tabletop Exercise | |
| | +-------------------------------------------------+ | |
| | | - Walk through DR procedures | | |
| | | - Identify gaps and issues | | |
| | | - Update documentation | | |
| | +-------------------------------------------------+ | |
| | | |
| | 2. Component Testing | |
| | +-------------------------------------------------+ | |
| | | - Test individual components | | |
| | | - Verify backup/restore | | |
| | | - Validate replication | | |
| | +-------------------------------------------------+ | |
| | | |
| | 3. Simulation Testing | |
| | +-------------------------------------------------+ | |
| | | - Simulate failure scenarios | | |
| | | - Test failover procedures | | |
| | | - Measure RTO/RPO | | |
| | +-------------------------------------------------+ | |
| | | |
| | 4. Full-Scale Testing | |
| | +-------------------------------------------------+ | |
| | | - Complete failover to DR | | |
| | | - Full production traffic | | |
| | | - Validate all systems | | |
| | +-------------------------------------------------+ | |
| | | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
# AWS Fault Injection Simulator Experiment
Resources:
FISExperimentRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: fis.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorNetworkAccess
AZFailureExperiment:
Type: AWS::FIS::ExperimentTemplate
Properties:
Description: Simulate AZ failure
RoleArn: !GetAtt FISExperimentRole.Arn
Targets:
Instances:
ResourceType: aws:ec2:instance
ResourceTags:
Environment: production
SelectionMode: COUNT(1)
Parameters:
AvailabilityZone: us-east-1a
Actions:
TerminateInstance:
ActionId: aws:ec2:terminate-instances
Parameters:
instanceIds: ${Instances}
Targets:
Instances: Instances
StopConditions:
- Source: aws:cloudwatch:alarm
Value: !Ref HighErrorRateAlarm
Tags:
Name: AZ-Failure-Test

HA Best Practices
+------------------------------------------------------------------+
| |
| 1. Design for Failure |
| +--------------------------------------------------------+ |
| | - Assume components will fail | |
| | - Implement redundancy at all layers | |
| | - Use managed services with built-in HA | |
| +--------------------------------------------------------+ |
| |
| 2. Implement Health Checks |
| +--------------------------------------------------------+ |
| | - Application-level health checks | |
| | - Load balancer health checks | |
| | - Route53 health checks | |
| | - Auto Scaling health checks | |
| +--------------------------------------------------------+ |
| |
| 3. Automate Recovery |
| +--------------------------------------------------------+ |
| | - Auto Scaling for self-healing | |
| | - Automated failover for databases | |
| | - Route53 automatic failover | |
| +--------------------------------------------------------+ |
| |
| 4. Test Regularly |
| +--------------------------------------------------------+ |
| | - Conduct regular DR tests | |
| | - Use chaos engineering | |
| | - Document and improve | |
| +--------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
# DR Readiness Checklist
## Infrastructure
- [ ] Multi-AZ deployment for all tiers
- [ ] Cross-region replication configured
- [ ] Automated backups enabled
- [ ] Backup retention meets RPO requirements
## Database
- [ ] Multi-AZ RDS or Aurora configured
- [ ] Read replicas in DR region
- [ ] Cross-region replication tested
- [ ] Point-in-time recovery enabled
## Application
- [ ] Stateless application design
- [ ] Auto Scaling configured
- [ ] Load balancer health checks
- [ ] Blue/green deployment capability
## Network
- [ ] Route53 health checks configured
- [ ] Failover routing policies
- [ ] Cross-region VPC peering
- [ ] VPN/Direct Connect redundancy
## Operations
- [ ] DR runbook documented
- [ ] Failover automation tested
- [ ] Communication plan established
- [ ] Regular DR drills scheduled

TopicKey Points
RTO/RPODefine recovery objectives before designing
Multi-AZDeploy across multiple AZs for HA
DR StrategiesChoose strategy based on RTO/RPO requirements
AutomationAutomate failover and recovery processes
TestingRegularly test DR procedures
MonitoringImplement comprehensive health checks


Next Chapter: Chapter 47 - Cost Optimization & FinOps