Skip to content

High Availability & Disaster Recovery

Chapter 46: High Availability & Disaster Recovery Architecture

Section titled “Chapter 46: High Availability & Disaster Recovery Architecture”

High Availability (HA) and Disaster Recovery (DR) are critical components of enterprise architecture, ensuring business continuity and minimal downtime during failures.

HA & DR Overview
+------------------------------------------------------------------+
| |
| +------------------------+ |
| | Resilient Design | |
| +------------------------+ |
| | |
| +---------------------+---------------------+ |
| | | | | |
| v v v v |
| +----------+ +----------+ +----------+ +----------+ |
| | High | | Disaster | | Fault | | Recovery | |
| | Availabi | | Recovery | | Tolerance| | Time | |
| | -lity | | | | | | Objectives| |
| | | | | | | | | |
| | - Uptime | | - DR Plan| | - Redund | | - RTO | |
| | - Redund | | - Backup | | - Failover| | - RPO | |
| | - Load | | - Restore| | - Self-Heal| | - SLA | |
| +----------+ +----------+ +----------+ +----------+ |
| |
+------------------------------------------------------------------+
ConceptDescription
RTORecovery Time Objective - Maximum acceptable downtime
RPORecovery Point Objective - Maximum acceptable data loss
SLAService Level Agreement - Contractual uptime guarantee
MTTRMean Time To Recovery - Average recovery time

Recovery Objectives
+------------------------------------------------------------------+
| |
| RTO (Recovery Time Objective) |
| +----------------------------------------------------------+ |
| | | |
| | Disaster Recovery Time Service Restored | |
| | +--------+ +----------------+ +----------+ | |
| | | Event | ----> | | --> | Restored | | |
| | +--------+ +----------------+ +----------+ | |
| | |<----- RTO ---->| | |
| | | |
| | Examples: | |
| | - Mission Critical: < 15 minutes | |
| | - Business Critical: < 1 hour | |
| | - Business Operational: < 4 hours | |
| | - Non-Critical: < 24 hours | |
| | | |
| +----------------------------------------------------------+ |
| |
| RPO (Recovery Point Objective) |
| +----------------------------------------------------------+ |
| | | |
| | Last Backup Data Loss Recovery Point | |
| | +--------+ +--------+ +----------+ | |
| | | Backup | ------->| Lost | --------> | Recover | | |
| | +--------+ +--------+ +----------+ | |
| | |<--------------- RPO --------------->| | |
| | | |
| | Examples: | |
| | - Zero Data Loss: RPO = 0 (synchronous replication) | |
| | - Near-Zero: RPO < 1 minute (async replication) | |
| | - Low: RPO < 1 hour (frequent backups) | |
| | - Standard: RPO < 24 hours (daily backups) | |
| | | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Availability SLA
+------------------------------------------------------------------+
| |
| Availability Percentage | Max Downtime per Year |
| ---------------------------+------------------------------------ |
| 99% (Two 9s) | 3.65 days (87.6 hours) |
| 99.9% (Three 9s) | 8.76 hours |
| 99.95% | 4.38 hours |
| 99.99% (Four 9s) | 52.6 minutes |
| 99.999% (Five 9s) | 5.26 minutes |
| |
| Calculation: |
| Availability = (Total Time - Downtime) / Total Time × 100 |
| |
| Example for 99.99% SLA: |
| - Per Year: 365 × 24 × 60 × (1 - 0.9999) = 52.6 minutes |
| - Per Month: 30 × 24 × 60 × (1 - 0.9999) = 4.3 minutes |
| - Per Week: 7 × 24 × 60 × (1 - 0.9999) = 1 minute |
| |
+------------------------------------------------------------------+

Multi-AZ Architecture
+------------------------------------------------------------------+
| |
| Region (us-east-1) |
| +----------------------------------------------------------+ |
| | | |
| | +------------------+ +------------------+ | |
| | | Availability | | Availability | | |
| | | Zone A | | Zone B | | |
| | | +--------------+ | | +--------------+ | | |
| | | | | | | | | | | |
| | | | +--------+ | | | | +--------+ | | | |
| | | | | EC2 | | | | | | EC2 | | | | |
| | | | | Active | | | | | | Standby| | | | |
| | | | +--------+ | | | | +--------+ | | | |
| | | | | | | | | | | |
| | | +------+-------+ | | +------+-------+ | | |
| | | | | | | | | |
| | +--------+---------+ +--------+---------+ | |
| | | | | |
| | v v | |
| | +----------------------------------------------------------+ |
| | | Application Load Balancer | |
| | +----------------------------------------------------------+ |
| | | | | |
| | v v | |
| | +------------------+ +------------------+ | |
| | | RDS Primary | | RDS Standby | | |
| | | (Sync Replication)|<-| (Multi-AZ) | | |
| | +------------------+ +------------------+ | |
| | | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Active-Active Architecture
+------------------------------------------------------------------+
| |
| Region (us-east-1) |
| +----------------------------------------------------------+ |
| | | |
| | +------------------+ +------------------+ | |
| | | Availability | | Availability | | |
| | | Zone A | | Zone B | | |
| | | | | | | |
| | | +--------+ | | +--------+ | | |
| | | | EC2 | | | | EC2 | | | |
| | | | Active | | | | Active | | | |
| | | +--------+ | | +--------+ | | |
| | | | | | | | | |
| | | v | | v | | |
| | | +--------+ | | +--------+ | | |
| | | | EC2 | | | | EC2 | | | |
| | | | Active | | | | Active | | | |
| | | +--------+ | | +--------+ | | |
| | | | | | | |
| | +--------+---------+ +---------+--------+ | |
| | | | | |
| | +----------+-------------+ | |
| | | | |
| | v | |
| | +----------------------------------------------------------+ |
| | | Application Load Balancer | |
| | | (Distributes traffic across all AZs) | |
| | +----------------------------------------------------------+ |
| | | | |
| | v | |
| | +----------------------------------------------------------+ |
| | | Aurora Database (Multi-AZ) | |
| | | +--------+ +--------+ +--------+ +--------+ | |
| | | |Primary | |Replica1| |Replica2| |Replica3| | |
| | | +--------+ +--------+ +--------+ +--------+ | |
| | +----------------------------------------------------------+ |
| | | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
# Auto Scaling Group Configuration
Resources:
WebServerASG:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
VPCZoneIdentifier:
- subnet-az-a
- subnet-az-b
- subnet-az-c
LaunchTemplate:
LaunchTemplateId: !Ref WebServerLaunchTemplate
Version: !GetAtt WebServerLaunchTemplate.LatestVersionNumber
MinSize: 3
MaxSize: 12
DesiredCapacity: 6
HealthCheckType: ELB
HealthCheckGracePeriod: 300
TargetGroupARNs:
- !Ref WebServerTargetGroup
Tags:
- Key: Name
Value: web-server
PropagateAtLaunch: true
# Scale out policy
ScaleOutPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AutoScalingGroupName: !Ref WebServerASG
PolicyType: TargetTrackingScaling
TargetTrackingConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ASGAverageCPUUtilization
TargetValue: 70
ScaleOutCooldown: 60
ScaleInCooldown: 300
# Scale on ALB request count
ScaleOnRequestCount:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AutoScalingGroupName: !Ref WebServerASG
PolicyType: TargetTrackingScaling
TargetTrackingConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ALBRequestCountPerTarget
ResourceLabel: !Sub "${WebServerLoadBalancer.LoadBalancerArn}/${WebServerTargetGroup.TargetGroupArn}"
TargetValue: 1000

DR Strategies
+------------------------------------------------------------------+
| |
| Strategy | RTO | RPO | Cost |
| -------------------+------------+------------+------------------ |
| Backup & Restore | Hours-Days | Hours-Days | $ |
| Pilot Light | Hours | Hours | $$ |
| Warm Standby | Minutes | Minutes | $$$ |
| Multi-Region | Seconds | Seconds | $$$$ |
| Active-Active | Real-time | Zero | $$$$$ |
| |
+------------------------------------------------------------------+
Backup & Restore Strategy
+------------------------------------------------------------------+
| |
| Primary Region |
| +----------------------------------------------------------+ |
| | | |
| | +----------+ +----------+ +----------+ | |
| | | EC2 | | RDS | | S3 | | |
| | | Instances| | Database | | Buckets | | |
| | +----+-----+ +----+-----+ +----+-----+ | |
| | | | | | |
| | v v v | |
| | +----------------------------------------------------+ | |
| | | Backup Process | | |
| | | - EBS Snapshots | | |
| | | - RDS Snapshots | | |
| | | - S3 Cross-Region Replication | | |
| | +----------------------------------------------------+ | |
| | | | | | |
| +-------+------------+------------+-------------------------+ |
| | | | |
| v v v |
| +----------------------------------------------------------+ |
| | S3 (Backup Storage) | |
| +----------------------------------------------------------+ |
| | | | |
| v v v |
| DR Region |
| +----------------------------------------------------------+ |
| | | |
| | Recovery Process: | |
| | 1. Restore EBS snapshots to new volumes | |
| | 2. Restore RDS snapshot to new instance | |
| | 3. Create EC2 instances from AMI | |
| | 4. Update DNS to point to DR region | |
| | | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
Pilot Light Strategy
+------------------------------------------------------------------+
| |
| Primary Region DR Region |
| +------------------------+ +------------------------+ |
| | | | | |
| | +------------------+ | | +------------------+ | |
| | | Application Tier | | | | Application Tier | | |
| | | | | | | | | |
| | | +--------+ | | | | +--------+ | | |
| | | | EC2 | | | | | | EC2 | | | |
| | | | Active | | | | | | Stopped| | | |
| | | +--------+ | | | | +--------+ | | |
| | | +--------+ | | | | +--------+ | | |
| | | | EC2 | | | | | | EC2 | | | |
| | | | Active | | | | | | Stopped| | | |
| | | +--------+ | | | | +--------+ | | |
| | +------------------+ | | +------------------+ | |
| | | | | ^ | |
| | v | | | | |
| | +------------------+ | | +------------------+ | |
| | | Database Tier | | | | Database Tier | | |
| | | | | | | | | |
| | | +--------+ | | | | +--------+ | | |
| | | | RDS |------|--|------>| | RDS | | | |
| | | | Primary| Async| | Rep | | Replica| | | |
| | | +--------+ | | | | +--------+ | | |
| | +------------------+ | | +------------------+ | |
| | | | | |
| +------------------------+ +------------------------+ |
| |
| Failover Process: |
| 1. Promote RDS replica to primary |
| 2. Start EC2 instances in DR region |
| 3. Update Route53 to point to DR region |
| 4. Scale out application tier as needed |
| |
+------------------------------------------------------------------+
Warm Standby Strategy
+------------------------------------------------------------------+
| |
| Primary Region DR Region |
| +------------------------+ +------------------------+ |
| | | | | |
| | +------------------+ | | +------------------+ | |
| | | Application Tier | | | | Application Tier | | |
| | | | | | | | | |
| | | +--------+ | | | | +--------+ | | |
| | | | EC2 | | | | | | EC2 | | | |
| | | | Active | | | | | | Active | | | |
| | | +--------+ | | | | | (Scaled| | | |
| | | +--------+ | | | | | Down) | | | |
| | | | EC2 | | | | | +--------+ | | |
| | | | Active | | | | | +--------+ | | |
| | | +--------+ | | | | | EC2 | | | |
| | +------------------+ | | | | Active | | | |
| | | | | +------------------+ | |
| | v | | ^ | |
| | +------------------+ | | | | |
| | | Database Tier | | | +------------------+ | |
| | | | | | | Database Tier | | |
| | | +--------+ | | | | | | |
| | | | RDS |------|--|------>| | RDS | | | |
| | | | Primary| Async| | Rep | | Replica| | | |
| | | +--------+ | | | | +--------+ | | |
| | +------------------+ | | +------------------+ | |
| | | | | |
| +------------------------+ +------------------------+ |
| |
| Failover Process: |
| 1. Promote RDS replica to primary |
| 2. Scale out EC2 instances in DR region |
| 3. Update Route53 health checks |
| 4. Traffic automatically routes to DR |
| |
+------------------------------------------------------------------+
Multi-Region Active-Active
+------------------------------------------------------------------+
| |
| Route53 |
| +----------------------------------------------------------+ |
| | Latency-Based Routing | |
| +----------------------------------------------------------+ |
| | | |
| v v |
| Primary Region DR Region |
| +------------------------+ +------------------------+ |
| | | | | |
| | +------------------+ | | +------------------+ | |
| | | Application Tier | | | | Application Tier | | |
| | | | | | | | | |
| | | +--------+ | | | | +--------+ | | |
| | | | EC2 | | | | | | EC2 | | | |
| | | | Active | | | | | | Active | | | |
| | | +--------+ | | | | +--------+ | | |
| | | +--------+ | | | | +--------+ | | |
| | | | EC2 | | | | | | EC2 | | | |
| | | | Active | | | | | | Active | | | |
| | | +--------+ | | | | +--------+ | | |
| | +------------------+ | | +------------------+ | |
| | | | | | | |
| | v | | v | |
| | +------------------+ | | +------------------+ | |
| | | Database Tier | | | | Database Tier | | |
| | | | | | | | | |
| | | +--------+ | | | | +--------+ | | |
| | | | Aurora |<-----|--|------>| | Aurora | | | |
| | | | Primary| Sync | | Rep | | Primary| | | |
| | | +--------+ | | | | +--------+ | | |
| | +------------------+ | | +------------------+ | |
| | | | | |
| +------------------------+ +------------------------+ |
| |
| Features: |
| - Global load balancing with Route53 |
| - Aurora Global Database for multi-region |
| - DynamoDB Global Tables for NoSQL |
| - S3 Cross-Region Replication |
| |
+------------------------------------------------------------------+

# RDS Multi-AZ Configuration
Resources:
PrimaryDB:
Type: AWS::RDS::DBInstance
Properties:
DBInstanceIdentifier: primary-db
Engine: postgres
EngineVersion: "14.7"
DBInstanceClass: db.r6g.xlarge
AllocatedStorage: 100
StorageType: gp3
MultiAZ: true
AvailabilityZone: us-east-1a
PrimaryDBInstanceIdentifier: !Ref PrimaryDB
DBSubnetGroupName: !Ref DBSubnetGroup
VPCSecurityGroups:
- !Ref DBSecurityGroup
BackupRetentionPeriod: 7
BackupWindow: "03:00-04:00"
MaintenanceWindow: "sun:04:00-sun:05:00"
DeletionProtection: true
StorageEncrypted: true
KmsKeyId: !Ref DBKMSKey
DBSubnetGroup:
Type: AWS::RDS::DBSubnetGroup
Properties:
DBSubnetGroupDescription: DB subnet group
SubnetIds:
- subnet-az-a
- subnet-az-b
- subnet-az-c
# Aurora Global Database
Resources:
AuroraCluster:
Type: AWS::RDS::DBCluster
Properties:
Engine: aurora-postgresql
EngineVersion: "14.7"
DatabaseName: appdb
MasterUsername: admin
MasterUserPassword: !Ref DBPassword
DBClusterParameterGroupName: default.aurora-postgresql14
VpcSecurityGroupIds:
- !Ref DBSecurityGroup
DBSubnetGroupName: !Ref DBSubnetGroup
EnableCloudwatchLogsExports:
- postgresql
BackupRetentionPeriod: 35
PreferredBackupWindow: "03:00-04:00"
PreferredMaintenanceWindow: "sun:04:00-sun:05:00"
DeletionProtection: true
StorageEncrypted: true
KmsKeyId: !Ref DBKMSKey
# Primary region instances
AuroraPrimaryInstance1:
Type: AWS::RDS::DBInstance
Properties:
DBClusterIdentifier: !Ref AuroraCluster
Engine: aurora-postgresql
DBInstanceClass: db.r6g.xlarge
AvailabilityZone: us-east-1a
AuroraPrimaryInstance2:
Type: AWS::RDS::DBInstance
Properties:
DBClusterIdentifier: !Ref AuroraCluster
Engine: aurora-postgresql
DBInstanceClass: db.r6g.xlarge
AvailabilityZone: us-east-1b
# Global cluster
AuroraGlobalCluster:
Type: AWS::RDS::GlobalCluster
Properties:
GlobalClusterIdentifier: app-global-cluster
SourceDBClusterIdentifier: !Ref AuroraCluster
StorageEncrypted: true
# Secondary region cluster (in us-west-2)
AuroraSecondaryCluster:
Type: AWS::RDS::DBCluster
Properties:
Engine: aurora-postgresql
EngineVersion: "14.7"
GlobalClusterIdentifier: !Ref AuroraGlobalCluster
Region: us-west-2
DBSubnetGroupName: !Ref DBSubnetGroupDR
VpcSecurityGroupIds:
- !Ref DBSecurityGroupDR
# DynamoDB Global Table
Resources:
GlobalTable:
Type: AWS::DynamoDB::GlobalTable
Properties:
TableName: ApplicationData
AttributeDefinitions:
- AttributeName: PK
AttributeType: S
- AttributeName: SK
AttributeType: S
KeySchema:
- AttributeName: PK
KeyType: HASH
- AttributeName: SK
KeyType: RANGE
BillingMode: PAY_PER_REQUEST
StreamSpecification:
StreamViewType: NEW_AND_OLD_IMAGES
Replicas:
- Region: us-east-1
PointInTimeRecoverySpecification:
PointInTimeRecoveryEnabled: true
- Region: us-west-2
PointInTimeRecoverySpecification:
PointInTimeRecoveryEnabled: true
- Region: eu-west-1
PointInTimeRecoverySpecification:
PointInTimeRecoveryEnabled: true
SSESpecification:
SSEEnabled: true
SSEType: KMS

# Multi-AZ Load Balancer
Resources:
ApplicationLoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
Name: web-app-alb
Type: application
Scheme: internet-facing
IpAddressType: ipv4
SecurityGroups:
- !Ref ALBSecurityGroup
Subnets:
- subnet-az-a
- subnet-az-b
- subnet-az-c
LoadBalancerAttributes:
- Key: idle_timeout.timeout_seconds
Value: "60"
- Key: deletion_protection.enabled
Value: "true"
- Key: routing.http2.enabled
Value: "true"
# Target group with health checks
WebServerTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
Name: web-servers
Port: 80
Protocol: HTTP
VpcId: !Ref VPC
HealthCheckEnabled: true
HealthCheckIntervalSeconds: 30
HealthCheckPath: /health
HealthCheckPort: 80
HealthCheckProtocol: HTTP
HealthCheckTimeoutSeconds: 10
HealthyThresholdCount: 3
UnhealthyThresholdCount: 3
TargetGroupAttributes:
- Key: deregistration_delay.timeout_seconds
Value: "30"
- Key: stickiness.enabled
Value: "false"
# Listener with failover
HTTPListener:
Type: AWS::ElasticLoadBalancingV2::Listener
Properties:
LoadBalancerArn: !Ref ApplicationLoadBalancer
Port: 80
Protocol: HTTP
DefaultActions:
- Type: forward
TargetGroupArn: !Ref WebServerTargetGroup
# Route53 Failover Configuration
Resources:
PrimaryHealthCheck:
Type: AWS::Route53::HealthCheck
Properties:
HealthCheckConfig:
Type: HTTPS
FullyQualifiedDomainName: primary.example.com
Port: 443
ResourcePath: /health
FailureThreshold: 3
RequestInterval: 30
MeasureLatency: true
HealthCheckTags:
- Key: Name
Value: primary-health-check
DRHealthCheck:
Type: AWS::Route53::HealthCheck
Properties:
HealthCheckConfig:
Type: HTTPS
FullyQualifiedDomainName: dr.example.com
Port: 443
ResourcePath: /health
FailureThreshold: 3
RequestInterval: 30
MeasureLatency: true
HealthCheckTags:
- Key: Name
Value: dr-health-check
# Primary record set
PrimaryRecordSet:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneId: !Ref HostedZone
Name: www.example.com
Type: A
AliasTarget:
HostedZoneId: !GetAtt ApplicationLoadBalancer.CanonicalHostedZoneID
DNSName: !GetAtt ApplicationLoadBalancer.DNSName
EvaluateTargetHealth: true
Failover: PRIMARY
HealthCheckId: !Ref PrimaryHealthCheck
SetIdentifier: primary
# DR record set
DRRecordSet:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneId: !Ref HostedZone
Name: www.example.com
Type: A
AliasTarget:
HostedZoneId: !GetAtt DRApplicationLoadBalancer.CanonicalHostedZoneID
DNSName: !GetAtt DRApplicationLoadBalancer.DNSName
EvaluateTargetHealth: true
Failover: SECONDARY
HealthCheckId: !Ref DRHealthCheck
SetIdentifier: dr

# AWS Backup Configuration
Resources:
BackupVault:
Type: AWS::Backup::BackupVault
Properties:
BackupVaultName: ApplicationBackupVault
EncryptionKeyArn: !Ref BackupKMSKey
# Daily backup plan
DailyBackupPlan:
Type: AWS::Backup::BackupPlan
Properties:
BackupPlan:
BackupPlanName: DailyBackupPlan
BackupPlanRule:
- RuleName: DailyBackupRule
TargetBackupVault: !Ref BackupVault
ScheduleExpression: "cron(0 5 ? * * *)" # Daily at 5 AM UTC
StartWindowMinutes: 60
CompletionWindowMinutes: 120
Lifecycle:
DeleteAfterDays: 30
MoveToColdStorageAfterDays: 7
CopyActions:
- DestinationBackupVaultArn: !Ref DRBackupVault
Lifecycle:
DeleteAfterDays: 90
# Weekly backup plan
WeeklyBackupPlan:
Type: AWS::Backup::BackupPlan
Properties:
BackupPlan:
BackupPlanName: WeeklyBackupPlan
BackupPlanRule:
- RuleName: WeeklyBackupRule
TargetBackupVault: !Ref BackupVault
ScheduleExpression: "cron(0 5 ? * SUN *)" # Weekly on Sunday
StartWindowMinutes: 60
CompletionWindowMinutes: 240
Lifecycle:
DeleteAfterDays: 365
MoveToColdStorageAfterDays: 30
# Backup selection
BackupSelection:
Type: AWS::Backup::BackupSelection
Properties:
BackupPlanId: !Ref DailyBackupPlan
BackupSelection:
SelectionName: ApplicationResources
IamRoleArn: !Ref BackupServiceRole
Resources:
- !Ref PrimaryDB
- !Ref EBSVolume1
- !Ref EBSVolume2
ListOfTags:
- ConditionType: STRINGEQUALS
ConditionKey: Backup
ConditionValue: required
# S3 Cross-Region Replication
Resources:
PrimaryBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: primary-data-bucket
VersioningConfiguration:
Status: Enabled
ReplicationConfiguration:
Role: !Ref ReplicationRole
Rules:
- Id: ReplicateAll
Status: Enabled
Priority: 1
DeleteMarkerReplication:
Status: Enabled
Destination:
Bucket: !Ref DRBucket
StorageClass: STANDARD
ReplicationTime:
Status: Enabled
Time: 15 # Replicate within 15 minutes
Metrics:
Status: Enabled
LifecycleConfiguration:
Rules:
- Id: TransitionToGlacier
Status: Enabled
Transitions:
- TransitionInDays: 90
StorageClass: GLACIER
DRBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: dr-data-bucket
VersioningConfiguration:
Status: Enabled
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true

failover_function.py
import boto3
import json
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
"""
Automated failover function
"""
rds = boto3.client('rds')
route53 = boto3.client('route53')
autoscaling = boto3.client('autoscaling')
# Parse event
primary_region = event.get('primary_region', 'us-east-1')
dr_region = event.get('dr_region', 'us-west-2')
db_cluster = event.get('db_cluster')
hosted_zone_id = event.get('hosted_zone_id')
record_name = event.get('record_name')
asg_name = event.get('asg_name')
try:
# 1. Promote RDS replica to primary
logger.info(f"Promoting RDS replica in {dr_region}")
rds.promote_read_replica_db_cluster(
DBClusterIdentifier=db_cluster
)
# 2. Scale out DR Auto Scaling Group
logger.info(f"Scaling out ASG {asg_name}")
autoscaling.set_desired_capacity(
AutoScalingGroupName=asg_name,
DesiredCapacity=6,
HonorCooldown=False
)
# 3. Update Route53 record
logger.info(f"Updating Route53 record {record_name}")
route53.change_resource_record_sets(
HostedZoneId=hosted_zone_id,
ChangeBatch={
'Changes': [
{
'Action': 'UPSERT',
'ResourceRecordSet': {
'Name': record_name,
'Type': 'A',
'AliasTarget': {
'HostedZoneId': dr_alb_hosted_zone_id,
'DNSName': dr_alb_dns_name,
'EvaluateTargetHealth': True
}
}
}
]
}
)
# 4. Send notification
logger.info("Failover completed successfully")
return {
'statusCode': 200,
'body': json.dumps({
'message': 'Failover completed',
'primary_region': primary_region,
'dr_region': dr_region
})
}
except Exception as e:
logger.error(f"Failover failed: {str(e)}")
return {
'statusCode': 500,
'body': json.dumps({
'message': 'Failover failed',
'error': str(e)
})
}
# CloudWatch Events for automated failover
Resources:
FailoverTrigger:
Type: AWS::Events::Rule
Properties:
Name: FailoverTrigger
Description: Trigger failover on primary region failure
EventPattern:
source:
- aws.health
detail-type:
- AWS Health Event
detail:
eventType:
- issue
- accountNotification
service:
- EC2
- RDS
statusCode:
- open
State: ENABLED
Targets:
- Id: FailoverFunction
Arn: !Ref FailoverFunction
InputTransformer:
InputPathsMap:
region: $.region
service: $.detail.service
InputTemplate: |
{
"primary_region": "us-east-1",
"dr_region": "us-west-2",
"service": "<service>",
"trigger": "health_event"
}

DR Testing Framework
+------------------------------------------------------------------+
| |
| Test Types |
| +----------------------------------------------------------+ |
| | | |
| | 1. Tabletop Exercise | |
| | +-------------------------------------------------+ | |
| | | - Walk through DR procedures | | |
| | | - Identify gaps and issues | | |
| | | - Update documentation | | |
| | +-------------------------------------------------+ | |
| | | |
| | 2. Component Testing | |
| | +-------------------------------------------------+ | |
| | | - Test individual components | | |
| | | - Verify backup/restore | | |
| | | - Validate replication | | |
| | +-------------------------------------------------+ | |
| | | |
| | 3. Simulation Testing | |
| | +-------------------------------------------------+ | |
| | | - Simulate failure scenarios | | |
| | | - Test failover procedures | | |
| | | - Measure RTO/RPO | | |
| | +-------------------------------------------------+ | |
| | | |
| | 4. Full-Scale Testing | |
| | +-------------------------------------------------+ | |
| | | - Complete failover to DR | | |
| | | - Full production traffic | | |
| | | - Validate all systems | | |
| | +-------------------------------------------------+ | |
| | | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
# AWS Fault Injection Simulator Experiment
Resources:
FISExperimentRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: fis.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorNetworkAccess
AZFailureExperiment:
Type: AWS::FIS::ExperimentTemplate
Properties:
Description: Simulate AZ failure
RoleArn: !GetAtt FISExperimentRole.Arn
Targets:
Instances:
ResourceType: aws:ec2:instance
ResourceTags:
Environment: production
SelectionMode: COUNT(1)
Parameters:
AvailabilityZone: us-east-1a
Actions:
TerminateInstance:
ActionId: aws:ec2:terminate-instances
Parameters:
instanceIds: ${Instances}
Targets:
Instances: Instances
StopConditions:
- Source: aws:cloudwatch:alarm
Value: !Ref HighErrorRateAlarm
Tags:
Name: AZ-Failure-Test

HA/DR is the backbone of reliability engineering. SREs must design systems that survive failures and recover quickly to meet SLOs.

HA/DR in DevOps/SRE
+------------------------------------------------------------------+
| |
| SRE Reliability Engineering: |
| |
| 1. Error Budget Protection |
| +----------------------------------------------------------+ |
| | - HA reduces unplanned downtime | |
| | - DR minimizes recovery time | |
| | - Both protect error budgets | |
| +----------------------------------------------------------+ |
| |
| 2. Incident Response |
| +----------------------------------------------------------+ |
| | - Runbooks for failover procedures | |
| | - Automated detection and recovery | |
| | - Clear RTO/RPO targets | |
| +----------------------------------------------------------+ |
| |
| 3. Chaos Engineering |
| +----------------------------------------------------------+ |
| | - Regularly test failover procedures | |
| | - Game days for team training | |
| | - Build confidence in recovery | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

~/bin/test-failover.sh
# Test failover
#!/bin/bash
set -euo pipefail
# Check current primary
PRIMARY=$(aws rds describe-db-instances \
--db-instance-identifier my-db \
--query 'DBInstances[].DBInstanceStatus')
echo "Primary status: $PRIMARY"
# Trigger failover
aws rds failover-db-cluster \
--db-cluster-identifier my-cluster \
--target-db-instance-id my-replica

HA/DR Anti-Patterns
+------------------------------------------------------------------+
| |
| ❌ Mistake 1: Not Testing DR Procedures |
| +----------------------------------------------------------+ |
| | Problem: Assumes failover works without testing | |
| | Impact: Real disasters expose untested plans | |
| | Fix: Regular game days and chaos experiments | |
| +----------------------------------------------------------+ |
| |
| ❌ Mistake 2: Single AZ Deployments |
| +----------------------------------------------------------+ |
| | Problem: All resources in one AZ | |
| | Impact: AZ failure takes everything down | |
| | Fix: Deploy across multiple AZs | |
| +----------------------------------------------------------+ |
| |
| ❌ Mistake 3: Not Backing Up Data |
| +----------------------------------------------------------+ |
| | Problem: No backups or infrequent backups | |
| | Impact: Data loss on failure | |
| | Fix: Automated backups with tested restores | |
| +----------------------------------------------------------+ |
| |
| ❌ Mistake 4: Setting Unrealistic RTO/RPO |
| +----------------------------------------------------------+ |
| | Problem: Targets not matching business needs | |
| | Impact: Either overspend or miss recovery targets | |
| | Fix: Align with business requirements and budget | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

  1. Q: Explain RTO vs RPO.

    • A: RTO (Recovery Time Objective) is how long you can tolerate downtime. RPO (Recovery Point Objective) is how much data loss you can tolerate. These drive your DR strategy and architecture choices.
  2. Q: What’s the difference between active-passive and active-active DR?

    • A: Active-passive: secondary site is idle until failover. Cheaper but slower recovery. Active-active: both sites serve traffic simultaneously. More expensive but instant failover.
  1. Q: Design a multi-region disaster recovery strategy.
    • A: Use primary/secondary: (1) Primary in one region, DR in another, (2) Async replication for databases, (3) Automated failover with Route53 health checks, (4) Regular DR drills, (5) Documented runbooks.

HA Best Practices
+------------------------------------------------------------------+
| |
| 1. Design for Failure |
| +--------------------------------------------------------+ |
| | - Assume components will fail | |
| | - Implement redundancy at all layers | |
| | - Use managed services with built-in HA | |
| +--------------------------------------------------------+ |
| |
| 2. Implement Health Checks |
| +--------------------------------------------------------+ |
| | - Application-level health checks | |
| | - Load balancer health checks | |
| | - Route53 health checks | |
| | - Auto Scaling health checks | |
| +--------------------------------------------------------+ |
| |
| 3. Automate Recovery |
| +--------------------------------------------------------+ |
| | - Auto Scaling for self-healing | |
| | - Automated failover for databases | |
| | - Route53 automatic failover | |
| +--------------------------------------------------------+ |
| |
| 4. Test Regularly |
| +--------------------------------------------------------+ |
| | - Conduct regular DR tests | |
| | - Use chaos engineering | |
| | - Document and improve | |
| +--------------------------------------------------------+ |
| |
+------------------------------------------------------------------+
# DR Readiness Checklist
## Infrastructure
- [ ] Multi-AZ deployment for all tiers
- [ ] Cross-region replication configured
- [ ] Automated backups enabled
- [ ] Backup retention meets RPO requirements
## Database
- [ ] Multi-AZ RDS or Aurora configured
- [ ] Read replicas in DR region
- [ ] Cross-region replication tested
- [ ] Point-in-time recovery enabled
## Application
- [ ] Stateless application design
- [ ] Auto Scaling configured
- [ ] Load balancer health checks
- [ ] Blue/green deployment capability
## Network
- [ ] Route53 health checks configured
- [ ] Failover routing policies
- [ ] Cross-region VPC peering
- [ ] VPN/Direct Connect redundancy
## Operations
- [ ] DR runbook documented
- [ ] Failover automation tested
- [ ] Communication plan established
- [ ] Regular DR drills scheduled

TopicKey Points
RTO/RPODefine recovery objectives before designing
Multi-AZDeploy across multiple AZs for HA
DR StrategiesChoose strategy based on RTO/RPO requirements
AutomationAutomate failover and recovery processes
TestingRegularly test DR procedures
MonitoringImplement comprehensive health checks


Exam Tip

Key Exam Points
+------------------------------------------------------------------+
| |
| 1. RTO (Recovery Time Objective): Max acceptable downtime |
| |
| 2. RPO (Recovery Point Objective): Max acceptable data loss |
| |
| 3. DR Strategies: Backup & Restore, Pilot Light, Warm Standby, |
| Active-Active |
| |
| 4. Multi-AZ vs Multi-Region: AZ for HA, Region for DR |
| |
| 5. AWS Services: RDS Multi-AZ, Aurora Global, Route 53, ELB |
| |
| 6. Failover: Automatic vs Manual trigger |
| |
| 7. Data Replication: Synchronous vs Asynchronous |
| |
| 8. Testing: Use AWS FIS for chaos engineering |
| |
| 9. AWS Backup: Centralized backup management |
| |
| 10. Cost vs Recovery: Balance requirements with budget |
| |
+------------------------------------------------------------------+

Chapter 46 Summary
+------------------------------------------------------------------+
| |
| High Availability & Disaster Recovery |
| +------------------------------------------------------------+ |
| | - RTO/RPO: Define recovery objectives | |
| | - Multi-AZ: High availability within region | |
| | - Multi-Region: Disaster recovery | |
| | - Automation: Automated failover and recovery | |
| +------------------------------------------------------------+ |
| |
| DR Strategies |
| +------------------------------------------------------------+ |
| | - Backup & Restore: Low cost, high RTO |
| | - Pilot Light: Minimal core services |
| | - Warm Standby: Scaled-down production |
| | - Active-Active: Full production in multiple regions | |
| +------------------------------------------------------------+ |
| |
| Best Practices |
| +------------------------------------------------------------+ |
| | - Define RTO/RPO before design | |
| | - Test failover regularly | |
| | - Automate recovery processes | |
| | - Document runbooks | |
| +------------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

Next Chapter: Chapter 47 - Cost Optimization & FinOps