High Availability & Disaster Recovery

Chapter 46: High Availability & Disaster Recovery Architecture

Building Resilient AWS Architectures

46.1 Overview

High Availability (HA) and Disaster Recovery (DR) are critical components of enterprise architecture, ensuring business continuity and minimal downtime during failures.

                    HA & DR Overview
+------------------------------------------------------------------+
|                                                                   |
|                    +------------------------+                     |
|                    |    Resilient Design    |                     |
|                    +------------------------+                     |
|                              |                                    |
|        +---------------------+---------------------+              |
|        |            |            |            |                  |
|        v            v            v            v                  |
|  +----------+ +----------+ +----------+ +----------+            |
|  | High     | | Disaster | | Fault    | | Recovery |            |
|  | Availabi | | Recovery | | Tolerance| | Time     |            |
|  | -lity    | |          | |          | | Objectives|            |
|  |          | |          | |          | |          |            |
|  | - Uptime | | - DR Plan| | - Redund | | - RTO    |            |
|  | - Redund | | - Backup | | - Failover| | - RPO    |            |
|  | - Load   | | - Restore| | - Self-Heal| | - SLA   |            |
|  +----------+ +----------+ +----------+ +----------+            |
|                                                                   |
+------------------------------------------------------------------+

Key Concepts

Concept	Description
RTO	Recovery Time Objective - Maximum acceptable downtime
RPO	Recovery Point Objective - Maximum acceptable data loss
SLA	Service Level Agreement - Contractual uptime guarantee
MTTR	Mean Time To Recovery - Average recovery time

46.2 Recovery Objectives

RTO and RPO

                    Recovery Objectives
+------------------------------------------------------------------+
|                                                                   |
|  RTO (Recovery Time Objective)                                   |
|  +----------------------------------------------------------+   |
|  |                                                           |   |
|  |  Disaster         Recovery Time         Service Restored  |   |
|  |  +--------+       +----------------+     +----------+     |   |
|  |  | Event  | ----> |                | --> | Restored |     |   |
|  |  +--------+       +----------------+     +----------+     |   |
|  |                    |<----- RTO ---->|                      |   |
|  |                                                           |   |
|  |  Examples:                                                |   |
|  |  - Mission Critical: < 15 minutes                         |   |
|  |  - Business Critical: < 1 hour                            |   |
|  |  - Business Operational: < 4 hours                        |   |
|  |  - Non-Critical: < 24 hours                               |   |
|  |                                                           |   |
|  +----------------------------------------------------------+   |
|                                                                   |
|  RPO (Recovery Point Objective)                                  |
|  +----------------------------------------------------------+   |
|  |                                                           |   |
|  |  Last Backup         Data Loss           Recovery Point   |   |
|  |  +--------+         +--------+           +----------+     |   |
|  |  | Backup | ------->|  Lost  | --------> | Recover  |     |   |
|  |  +--------+         +--------+           +----------+     |   |
|  |       |<--------------- RPO --------------->|              |   |
|  |                                                           |   |
|  |  Examples:                                                |   |
|  |  - Zero Data Loss: RPO = 0 (synchronous replication)     |   |
|  |  - Near-Zero: RPO < 1 minute (async replication)         |   |
|  |  - Low: RPO < 1 hour (frequent backups)                  |   |
|  |  - Standard: RPO < 24 hours (daily backups)              |   |
|  |                                                           |   |
|  +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

SLA Calculations

                    Availability SLA
+------------------------------------------------------------------+
|                                                                   |
|  Availability Percentage    | Max Downtime per Year              |
| ---------------------------+------------------------------------  |
|  99% (Two 9s)              | 3.65 days (87.6 hours)             |
|  99.9% (Three 9s)         | 8.76 hours                         |
|  99.95%                   | 4.38 hours                         |
|  99.99% (Four 9s)         | 52.6 minutes                       |
|  99.999% (Five 9s)        | 5.26 minutes                       |
|                                                                   |
|  Calculation:                                                    |
|  Availability = (Total Time - Downtime) / Total Time × 100       |
|                                                                   |
|  Example for 99.99% SLA:                                        |
|  - Per Year: 365 × 24 × 60 × (1 - 0.9999) = 52.6 minutes        |
|  - Per Month: 30 × 24 × 60 × (1 - 0.9999) = 4.3 minutes         |
|  - Per Week: 7 × 24 × 60 × (1 - 0.9999) = 1 minute              |
|                                                                   |
+------------------------------------------------------------------+

46.3 High Availability Patterns

Multi-AZ Architecture

                    Multi-AZ Architecture
+------------------------------------------------------------------+
|                                                                   |
|  Region (us-east-1)                                              |
|  +----------------------------------------------------------+   |
|  |                                                           |   |
|  |  +------------------+  +------------------+               |   |
|  |  | Availability     |  | Availability     |               |   |
|  |  | Zone A           |  | Zone B           |               |   |
|  |  | +--------------+ |  | +--------------+ |               |   |
|  |  | |              | |  | |              | |               |   |
|  |  | |  +--------+  | |  | |  +--------+  | |               |   |
|  |  | |  | EC2    |  | |  | |  | EC2    |  | |               |   |
|  |  | |  | Active |  | |  | |  | Standby|  | |               |   |
|  |  | |  +--------+  | |  | |  +--------+  | |               |   |
|  |  | |              | |  | |              | |               |   |
|  |  | +------+-------+ |  | +------+-------+ |               |   |
|  |  |        |         |  |        |         |               |   |
|  |  +--------+---------+  +--------+---------+               |   |
|  |           |                   |                            |   |
|  |           v                   v                            |   |
|  |  +----------------------------------------------------------+ |
|  |  |              Application Load Balancer                   | |
|  |  +----------------------------------------------------------+ |
|  |           |                   |                            |   |
|  |           v                   v                            |   |
|  |  +------------------+  +------------------+               |   |
|  |  | RDS Primary      |  | RDS Standby      |               |   |
|  |  | (Sync Replication)|<-| (Multi-AZ)       |               |   |
|  |  +------------------+  +------------------+               |   |
|  |                                                           |   |
|  +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

Active-Active Architecture

                    Active-Active Architecture
+------------------------------------------------------------------+
|                                                                   |
|  Region (us-east-1)                                              |
|  +----------------------------------------------------------+   |
|  |                                                           |   |
|  |  +------------------+  +------------------+               |   |
|  |  | Availability     |  | Availability     |               |   |
|  |  | Zone A           |  | Zone B           |               |   |
|  |  |                  |  |                  |               |   |
|  |  |  +--------+      |  |      +--------+  |               |   |
|  |  |  | EC2    |      |  |      | EC2    |  |               |   |
|  |  |  | Active |      |  |      | Active |  |               |   |
|  |  |  +--------+      |  |      +--------+  |               |   |
|  |  |       |          |  |          |       |               |   |
|  |  |       v          |  |          v       |               |   |
|  |  |  +--------+      |  |      +--------+  |               |   |
|  |  |  | EC2    |      |  |      | EC2    |  |               |   |
|  |  |  | Active |      |  |      | Active |  |               |   |
|  |  |  +--------+      |  |      +--------+  |               |   |
|  |  |                  |  |                  |               |   |
|  |  +--------+---------+  +---------+--------+               |   |
|  |           |                        |                      |   |
|  |           +----------+-------------+                      |   |
|  |                      |                                     |   |
|  |                      v                                     |   |
|  |  +----------------------------------------------------------+ |
|  |  |              Application Load Balancer                   | |
|  |  |         (Distributes traffic across all AZs)              | |
|  |  +----------------------------------------------------------+ |
|  |                      |                                     |   |
|  |                      v                                     |   |
|  |  +----------------------------------------------------------+ |
|  |  |              Aurora Database (Multi-AZ)                  | |
|  |  |  +--------+  +--------+  +--------+  +--------+          | |
|  |  |  |Primary |  |Replica1|  |Replica2|  |Replica3|          | |
|  |  |  +--------+  +--------+  +--------+  +--------+          | |
|  |  +----------------------------------------------------------+ |
|  |                                                           |   |
|  +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

Auto Scaling for HA

# Auto Scaling Group Configuration
Resources:
  WebServerASG:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      VPCZoneIdentifier:
        - subnet-az-a
        - subnet-az-b
        - subnet-az-c
      LaunchTemplate:
        LaunchTemplateId: !Ref WebServerLaunchTemplate
        Version: !GetAtt WebServerLaunchTemplate.LatestVersionNumber
      MinSize: 3
      MaxSize: 12
      DesiredCapacity: 6
      HealthCheckType: ELB
      HealthCheckGracePeriod: 300
      TargetGroupARNs:
        - !Ref WebServerTargetGroup
      Tags:
        - Key: Name
          Value: web-server
          PropagateAtLaunch: true

  # Scale out policy
  ScaleOutPolicy:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AutoScalingGroupName: !Ref WebServerASG
      PolicyType: TargetTrackingScaling
      TargetTrackingConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: ASGAverageCPUUtilization
        TargetValue: 70
        ScaleOutCooldown: 60
        ScaleInCooldown: 300

  # Scale on ALB request count
  ScaleOnRequestCount:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AutoScalingGroupName: !Ref WebServerASG
      PolicyType: TargetTrackingScaling
      TargetTrackingConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: ALBRequestCountPerTarget
          ResourceLabel: !Sub "${WebServerLoadBalancer.LoadBalancerArn}/${WebServerTargetGroup.TargetGroupArn}"
        TargetValue: 1000

46.4 Disaster Recovery Strategies

DR Strategy Comparison

                    DR Strategies
+------------------------------------------------------------------+
|                                                                   |
|  Strategy          | RTO        | RPO        | Cost             |
| -------------------+------------+------------+------------------ |
|  Backup & Restore  | Hours-Days | Hours-Days | $                |
|  Pilot Light       | Hours      | Hours      | $$               |
|  Warm Standby      | Minutes    | Minutes    | $$$              |
|  Multi-Region      | Seconds    | Seconds    | $$$$             |
|  Active-Active     | Real-time  | Zero       | $$$$$            |
|                                                                   |
+------------------------------------------------------------------+

Backup & Restore

                    Backup & Restore Strategy
+------------------------------------------------------------------+
|                                                                   |
|  Primary Region                                                  |
|  +----------------------------------------------------------+   |
|  |                                                           |   |
|  |  +----------+  +----------+  +----------+               |   |
|  |  | EC2      |  | RDS      |  | S3       |               |   |
|  |  | Instances|  | Database |  | Buckets  |               |   |
|  |  +----+-----+  +----+-----+  +----+-----+               |   |
|  |       |            |            |                         |   |
|  |       v            v            v                         |   |
|  |  +----------------------------------------------------+ |   |
|  |  |              Backup Process                        | |   |
|  |  | - EBS Snapshots                                   | |   |
|  |  | - RDS Snapshots                                   | |   |
|  |  | - S3 Cross-Region Replication                      | |   |
|  |  +----------------------------------------------------+ |   |
|  |       |            |            |                         |   |
|  +-------+------------+------------+-------------------------+   |
|          |            |            |                            |
|          v            v            v                            |
|  +----------------------------------------------------------+   |
|  |                    S3 (Backup Storage)                    |   |
|  +----------------------------------------------------------+   |
|          |            |            |                            |
|          v            v            v                            |
|  DR Region                                                       |
|  +----------------------------------------------------------+   |
|  |                                                           |   |
|  |  Recovery Process:                                        |   |
|  |  1. Restore EBS snapshots to new volumes                  |   |
|  |  2. Restore RDS snapshot to new instance                  |   |
|  |  3. Create EC2 instances from AMI                         |   |
|  |  4. Update DNS to point to DR region                      |   |
|  |                                                           |   |
|  +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

Pilot Light

                    Pilot Light Strategy
+------------------------------------------------------------------+
|                                                                   |
|  Primary Region                    DR Region                     |
|  +------------------------+       +------------------------+     |
|  |                        |       |                        |     |
|  |  +------------------+  |       |  +------------------+  |     |
|  |  | Application Tier |  |       |  | Application Tier |  |     |
|  |  |                  |  |       |  |                  |  |     |
|  |  |  +--------+      |  |       |  |  +--------+      |  |     |
|  |  |  | EC2    |      |  |       |  |  | EC2    |      |  |     |
|  |  |  | Active |      |  |       |  |  | Stopped|      |  |     |
|  |  |  +--------+      |  |       |  |  +--------+      |  |     |
|  |  |  +--------+      |  |       |  |  +--------+      |  |     |
|  |  |  | EC2    |      |  |       |  |  | EC2    |      |  |     |
|  |  |  | Active |      |  |       |  |  | Stopped|      |  |     |
|  |  |  +--------+      |  |       |  |  +--------+      |  |     |
|  |  +------------------+  |       |  +------------------+  |     |
|  |           |            |       |           ^            |     |
|  |           v            |       |           |            |     |
|  |  +------------------+  |       |  +------------------+  |     |
|  |  | Database Tier    |  |       |  | Database Tier    |  |     |
|  |  |                  |  |       |  |                  |  |     |
|  |  |  +--------+      |  |       |  |  +--------+      |  |     |
|  |  |  | RDS    |------|--|------>|  | RDS    |      |  |     |
|  |  |  | Primary| Async|  | Rep  |  | Replica|      |  |     |
|  |  |  +--------+      |  |       |  | +--------+      |  |     |
|  |  +------------------+  |       |  +------------------+  |     |
|  |                        |       |                        |     |
|  +------------------------+       +------------------------+     |
|                                                                   |
|  Failover Process:                                               |
|  1. Promote RDS replica to primary                              |
|  2. Start EC2 instances in DR region                            |
|  3. Update Route53 to point to DR region                        |
|  4. Scale out application tier as needed                        |
|                                                                   |
+------------------------------------------------------------------+

Warm Standby

                    Warm Standby Strategy
+------------------------------------------------------------------+
|                                                                   |
|  Primary Region                    DR Region                     |
|  +------------------------+       +------------------------+     |
|  |                        |       |                        |     |
|  |  +------------------+  |       |  +------------------+  |     |
|  |  | Application Tier |  |       |  | Application Tier |  |     |
|  |  |                  |  |       |  |                  |  |     |
|  |  |  +--------+      |  |       |  |  +--------+      |  |     |
|  |  |  | EC2    |      |  |       |  |  | EC2    |      |  |     |
|  |  |  | Active |      |  |       |  |  | Active |      |  |     |
|  |  |  +--------+      |  |       |  |  | (Scaled|      |  |     |
|  |  |  +--------+      |  |       |  |  |  Down) |      |  |     |
|  |  |  | EC2    |      |  |       |  |  +--------+      |  |     |
|  |  |  | Active |      |  |       |  |  +--------+      |  |     |
|  |  |  +--------+      |  |       |  |  | EC2    |      |  |     |
|  |  +------------------+  |       |  |  | Active |      |  |     |
|  |           |            |       |  +------------------+  |     |
|  |           v            |       |           ^            |     |
|  |  +------------------+  |       |           |            |     |
|  |  | Database Tier    |  |       |  +------------------+  |     |
|  |  |                  |  |       |  | Database Tier    |  |     |
|  |  |  +--------+      |  |       |  |                  |  |     |
|  |  |  | RDS    |------|--|------>|  | RDS    |      |  |     |
|  |  |  | Primary| Async|  | Rep  |  | Replica|      |  |     |
|  |  |  +--------+      |  |       |  | +--------+      |  |     |
|  |  +------------------+  |       |  +------------------+  |     |
|  |                        |       |                        |     |
|  +------------------------+       +------------------------+     |
|                                                                   |
|  Failover Process:                                               |
|  1. Promote RDS replica to primary                              |
|  2. Scale out EC2 instances in DR region                        |
|  3. Update Route53 health checks                                |
|  4. Traffic automatically routes to DR                          |
|                                                                   |
+------------------------------------------------------------------+

Multi-Region Active-Active

                    Multi-Region Active-Active
+------------------------------------------------------------------+
|                                                                   |
|  Route53                                                         |
|  +----------------------------------------------------------+   |
|  |              Latency-Based Routing                        |   |
|  +----------------------------------------------------------+   |
|           |                        |                            |
|           v                        v                            |
|  Primary Region                    DR Region                     |
|  +------------------------+       +------------------------+     |
|  |                        |       |                        |     |
|  |  +------------------+  |       |  +------------------+  |     |
|  |  | Application Tier |  |       |  | Application Tier |  |     |
|  |  |                  |  |       |  |                  |  |     |
|  |  |  +--------+      |  |       |  |  +--------+      |  |     |
|  |  |  | EC2    |      |  |       |  |  | EC2    |      |  |     |
|  |  |  | Active |      |  |       |  |  | Active |      |  |     |
|  |  |  +--------+      |  |       |  |  +--------+      |  |     |
|  |  |  +--------+      |  |       |  |  +--------+      |  |     |
|  |  |  | EC2    |      |  |       |  |  | EC2    |      |  |     |
|  |  |  | Active |      |  |       |  |  | Active |      |  |     |
|  |  |  +--------+      |  |       |  |  +--------+      |  |     |
|  |  +------------------+  |       |  +------------------+  |     |
|  |           |            |       |           |            |     |
|  |           v            |       |           v            |     |
|  |  +------------------+  |       |  +------------------+  |     |
|  |  | Database Tier    |  |       |  | Database Tier    |  |     |
|  |  |                  |  |       |  |                  |  |     |
|  |  |  +--------+      |  |       |  |  +--------+      |  |     |
|  |  |  | Aurora  |<-----|--|------>|  | Aurora  |      |  |     |
|  |  |  | Primary| Sync |  | Rep  |  | Primary|      |  |     |
|  |  |  +--------+      |  |       |  | +--------+      |  |     |
|  |  +------------------+  |       |  +------------------+  |     |
|  |                        |       |                        |     |
|  +------------------------+       +------------------------+     |
|                                                                   |
|  Features:                                                       |
|  - Global load balancing with Route53                           |
|  - Aurora Global Database for multi-region                       |
|  - DynamoDB Global Tables for NoSQL                             |
|  - S3 Cross-Region Replication                                  |
|                                                                   |
+------------------------------------------------------------------+

46.5 Database HA & DR

RDS Multi-AZ

# RDS Multi-AZ Configuration
Resources:
  PrimaryDB:
    Type: AWS::RDS::DBInstance
    Properties:
      DBInstanceIdentifier: primary-db
      Engine: postgres
      EngineVersion: "14.7"
      DBInstanceClass: db.r6g.xlarge
      AllocatedStorage: 100
      StorageType: gp3
      MultiAZ: true
      AvailabilityZone: us-east-1a
      PrimaryDBInstanceIdentifier: !Ref PrimaryDB
      DBSubnetGroupName: !Ref DBSubnetGroup
      VPCSecurityGroups:
        - !Ref DBSecurityGroup
      BackupRetentionPeriod: 7
      BackupWindow: "03:00-04:00"
      MaintenanceWindow: "sun:04:00-sun:05:00"
      DeletionProtection: true
      StorageEncrypted: true
      KmsKeyId: !Ref DBKMSKey

  DBSubnetGroup:
    Type: AWS::RDS::DBSubnetGroup
    Properties:
      DBSubnetGroupDescription: DB subnet group
      SubnetIds:
        - subnet-az-a
        - subnet-az-b
        - subnet-az-c

Aurora Global Database

# Aurora Global Database
Resources:
  AuroraCluster:
    Type: AWS::RDS::DBCluster
    Properties:
      Engine: aurora-postgresql
      EngineVersion: "14.7"
      DatabaseName: appdb
      MasterUsername: admin
      MasterUserPassword: !Ref DBPassword
      DBClusterParameterGroupName: default.aurora-postgresql14
      VpcSecurityGroupIds:
        - !Ref DBSecurityGroup
      DBSubnetGroupName: !Ref DBSubnetGroup
      EnableCloudwatchLogsExports:
        - postgresql
      BackupRetentionPeriod: 35
      PreferredBackupWindow: "03:00-04:00"
      PreferredMaintenanceWindow: "sun:04:00-sun:05:00"
      DeletionProtection: true
      StorageEncrypted: true
      KmsKeyId: !Ref DBKMSKey

  # Primary region instances
  AuroraPrimaryInstance1:
    Type: AWS::RDS::DBInstance
    Properties:
      DBClusterIdentifier: !Ref AuroraCluster
      Engine: aurora-postgresql
      DBInstanceClass: db.r6g.xlarge
      AvailabilityZone: us-east-1a

  AuroraPrimaryInstance2:
    Type: AWS::RDS::DBInstance
    Properties:
      DBClusterIdentifier: !Ref AuroraCluster
      Engine: aurora-postgresql
      DBInstanceClass: db.r6g.xlarge
      AvailabilityZone: us-east-1b

  # Global cluster
  AuroraGlobalCluster:
    Type: AWS::RDS::GlobalCluster
    Properties:
      GlobalClusterIdentifier: app-global-cluster
      SourceDBClusterIdentifier: !Ref AuroraCluster
      StorageEncrypted: true

  # Secondary region cluster (in us-west-2)
  AuroraSecondaryCluster:
    Type: AWS::RDS::DBCluster
    Properties:
      Engine: aurora-postgresql
      EngineVersion: "14.7"
      GlobalClusterIdentifier: !Ref AuroraGlobalCluster
      Region: us-west-2
      DBSubnetGroupName: !Ref DBSubnetGroupDR
      VpcSecurityGroupIds:
        - !Ref DBSecurityGroupDR

DynamoDB Global Tables

# DynamoDB Global Table
Resources:
  GlobalTable:
    Type: AWS::DynamoDB::GlobalTable
    Properties:
      TableName: ApplicationData
      AttributeDefinitions:
        - AttributeName: PK
          AttributeType: S
        - AttributeName: SK
          AttributeType: S
      KeySchema:
        - AttributeName: PK
          KeyType: HASH
        - AttributeName: SK
          KeyType: RANGE
      BillingMode: PAY_PER_REQUEST
      StreamSpecification:
        StreamViewType: NEW_AND_OLD_IMAGES
      Replicas:
        - Region: us-east-1
          PointInTimeRecoverySpecification:
            PointInTimeRecoveryEnabled: true
        - Region: us-west-2
          PointInTimeRecoverySpecification:
            PointInTimeRecoveryEnabled: true
        - Region: eu-west-1
          PointInTimeRecoverySpecification:
            PointInTimeRecoveryEnabled: true
      SSESpecification:
        SSEEnabled: true
        SSEType: KMS

46.6 Application HA & DR

Load Balancer Configuration

# Multi-AZ Load Balancer
Resources:
  ApplicationLoadBalancer:
    Type: AWS::ElasticLoadBalancingV2::LoadBalancer
    Properties:
      Name: web-app-alb
      Type: application
      Scheme: internet-facing
      IpAddressType: ipv4
      SecurityGroups:
        - !Ref ALBSecurityGroup
      Subnets:
        - subnet-az-a
        - subnet-az-b
        - subnet-az-c
      LoadBalancerAttributes:
        - Key: idle_timeout.timeout_seconds
          Value: "60"
        - Key: deletion_protection.enabled
          Value: "true"
        - Key: routing.http2.enabled
          Value: "true"

  # Target group with health checks
  WebServerTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Name: web-servers
      Port: 80
      Protocol: HTTP
      VpcId: !Ref VPC
      HealthCheckEnabled: true
      HealthCheckIntervalSeconds: 30
      HealthCheckPath: /health
      HealthCheckPort: 80
      HealthCheckProtocol: HTTP
      HealthCheckTimeoutSeconds: 10
      HealthyThresholdCount: 3
      UnhealthyThresholdCount: 3
      TargetGroupAttributes:
        - Key: deregistration_delay.timeout_seconds
          Value: "30"
        - Key: stickiness.enabled
          Value: "false"

  # Listener with failover
  HTTPListener:
    Type: AWS::ElasticLoadBalancingV2::Listener
    Properties:
      LoadBalancerArn: !Ref ApplicationLoadBalancer
      Port: 80
      Protocol: HTTP
      DefaultActions:
        - Type: forward
          TargetGroupArn: !Ref WebServerTargetGroup

Route53 Health Checks & Failover

# Route53 Failover Configuration
Resources:
  PrimaryHealthCheck:
    Type: AWS::Route53::HealthCheck
    Properties:
      HealthCheckConfig:
        Type: HTTPS
        FullyQualifiedDomainName: primary.example.com
        Port: 443
        ResourcePath: /health
        FailureThreshold: 3
        RequestInterval: 30
        MeasureLatency: true
      HealthCheckTags:
        - Key: Name
          Value: primary-health-check

  DRHealthCheck:
    Type: AWS::Route53::HealthCheck
    Properties:
      HealthCheckConfig:
        Type: HTTPS
        FullyQualifiedDomainName: dr.example.com
        Port: 443
        ResourcePath: /health
        FailureThreshold: 3
        RequestInterval: 30
        MeasureLatency: true
      HealthCheckTags:
        - Key: Name
          Value: dr-health-check

  # Primary record set
  PrimaryRecordSet:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: !Ref HostedZone
      Name: www.example.com
      Type: A
      AliasTarget:
        HostedZoneId: !GetAtt ApplicationLoadBalancer.CanonicalHostedZoneID
        DNSName: !GetAtt ApplicationLoadBalancer.DNSName
        EvaluateTargetHealth: true
      Failover: PRIMARY
      HealthCheckId: !Ref PrimaryHealthCheck
      SetIdentifier: primary

  # DR record set
  DRRecordSet:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: !Ref HostedZone
      Name: www.example.com
      Type: A
      AliasTarget:
        HostedZoneId: !GetAtt DRApplicationLoadBalancer.CanonicalHostedZoneID
        DNSName: !GetAtt DRApplicationLoadBalancer.DNSName
        EvaluateTargetHealth: true
      Failover: SECONDARY
      HealthCheckId: !Ref DRHealthCheck
      SetIdentifier: dr

46.7 Backup Strategies

AWS Backup Configuration

# AWS Backup Configuration
Resources:
  BackupVault:
    Type: AWS::Backup::BackupVault
    Properties:
      BackupVaultName: ApplicationBackupVault
      EncryptionKeyArn: !Ref BackupKMSKey

  # Daily backup plan
  DailyBackupPlan:
    Type: AWS::Backup::BackupPlan
    Properties:
      BackupPlan:
        BackupPlanName: DailyBackupPlan
        BackupPlanRule:
          - RuleName: DailyBackupRule
            TargetBackupVault: !Ref BackupVault
            ScheduleExpression: "cron(0 5 ? * * *)"  # Daily at 5 AM UTC
            StartWindowMinutes: 60
            CompletionWindowMinutes: 120
            Lifecycle:
              DeleteAfterDays: 30
              MoveToColdStorageAfterDays: 7
            CopyActions:
              - DestinationBackupVaultArn: !Ref DRBackupVault
                Lifecycle:
                  DeleteAfterDays: 90

  # Weekly backup plan
  WeeklyBackupPlan:
    Type: AWS::Backup::BackupPlan
    Properties:
      BackupPlan:
        BackupPlanName: WeeklyBackupPlan
        BackupPlanRule:
          - RuleName: WeeklyBackupRule
            TargetBackupVault: !Ref BackupVault
            ScheduleExpression: "cron(0 5 ? * SUN *)"  # Weekly on Sunday
            StartWindowMinutes: 60
            CompletionWindowMinutes: 240
            Lifecycle:
              DeleteAfterDays: 365
              MoveToColdStorageAfterDays: 30

  # Backup selection
  BackupSelection:
    Type: AWS::Backup::BackupSelection
    Properties:
      BackupPlanId: !Ref DailyBackupPlan
      BackupSelection:
        SelectionName: ApplicationResources
        IamRoleArn: !Ref BackupServiceRole
        Resources:
          - !Ref PrimaryDB
          - !Ref EBSVolume1
          - !Ref EBSVolume2
        ListOfTags:
          - ConditionType: STRINGEQUALS
            ConditionKey: Backup
            ConditionValue: required

S3 Cross-Region Replication

# S3 Cross-Region Replication
Resources:
  PrimaryBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: primary-data-bucket
      VersioningConfiguration:
        Status: Enabled
      ReplicationConfiguration:
        Role: !Ref ReplicationRole
        Rules:
          - Id: ReplicateAll
            Status: Enabled
            Priority: 1
            DeleteMarkerReplication:
              Status: Enabled
            Destination:
              Bucket: !Ref DRBucket
              StorageClass: STANDARD
              ReplicationTime:
                Status: Enabled
                Time: 15  # Replicate within 15 minutes
              Metrics:
                Status: Enabled
      LifecycleConfiguration:
        Rules:
          - Id: TransitionToGlacier
            Status: Enabled
            Transitions:
              - TransitionInDays: 90
                StorageClass: GLACIER

  DRBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: dr-data-bucket
      VersioningConfiguration:
        Status: Enabled
      PublicAccessBlockConfiguration:
        BlockPublicAcls: true
        BlockPublicPolicy: true
        IgnorePublicAcls: true
        RestrictPublicBuckets: true

46.8 Failover Automation

Lambda Failover Function

import boto3
import json
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    """
    Automated failover function
    """

    rds = boto3.client('rds')
    route53 = boto3.client('route53')
    autoscaling = boto3.client('autoscaling')

    # Parse event
    primary_region = event.get('primary_region', 'us-east-1')
    dr_region = event.get('dr_region', 'us-west-2')
    db_cluster = event.get('db_cluster')
    hosted_zone_id = event.get('hosted_zone_id')
    record_name = event.get('record_name')
    asg_name = event.get('asg_name')

    try:
        # 1. Promote RDS replica to primary
        logger.info(f"Promoting RDS replica in {dr_region}")
        rds.promote_read_replica_db_cluster(
            DBClusterIdentifier=db_cluster
        )

        # 2. Scale out DR Auto Scaling Group
        logger.info(f"Scaling out ASG {asg_name}")
        autoscaling.set_desired_capacity(
            AutoScalingGroupName=asg_name,
            DesiredCapacity=6,
            HonorCooldown=False
        )

        # 3. Update Route53 record
        logger.info(f"Updating Route53 record {record_name}")
        route53.change_resource_record_sets(
            HostedZoneId=hosted_zone_id,
            ChangeBatch={
                'Changes': [
                    {
                        'Action': 'UPSERT',
                        'ResourceRecordSet': {
                            'Name': record_name,
                            'Type': 'A',
                            'AliasTarget': {
                                'HostedZoneId': dr_alb_hosted_zone_id,
                                'DNSName': dr_alb_dns_name,
                                'EvaluateTargetHealth': True
                            }
                        }
                    }
                ]
            }
        )

        # 4. Send notification
        logger.info("Failover completed successfully")
        return {
            'statusCode': 200,
            'body': json.dumps({
                'message': 'Failover completed',
                'primary_region': primary_region,
                'dr_region': dr_region
            })
        }

    except Exception as e:
        logger.error(f"Failover failed: {str(e)}")
        return {
            'statusCode': 500,
            'body': json.dumps({
                'message': 'Failover failed',
                'error': str(e)
            })
        }

CloudWatch Events for Failover

# CloudWatch Events for automated failover
Resources:
  FailoverTrigger:
    Type: AWS::Events::Rule
    Properties:
      Name: FailoverTrigger
      Description: Trigger failover on primary region failure
      EventPattern:
        source:
          - aws.health
        detail-type:
          - AWS Health Event
        detail:
          eventType:
            - issue
            - accountNotification
          service:
            - EC2
            - RDS
          statusCode:
            - open
      State: ENABLED
      Targets:
        - Id: FailoverFunction
          Arn: !Ref FailoverFunction
          InputTransformer:
            InputPathsMap:
              region: $.region
              service: $.detail.service
            InputTemplate: |
              {
                "primary_region": "us-east-1",
                "dr_region": "us-west-2",
                "service": "<service>",
                "trigger": "health_event"
              }

46.9 Testing & Validation

DR Testing Framework

                    DR Testing Framework
+------------------------------------------------------------------+
|                                                                   |
|  Test Types                                                      |
|  +----------------------------------------------------------+   |
|  |                                                           |   |
|  |  1. Tabletop Exercise                                     |   |
|  |     +-------------------------------------------------+  |   |
|  |     | - Walk through DR procedures                      |  |   |
|  |     | - Identify gaps and issues                        |  |   |
|  |     | - Update documentation                            |  |   |
|  |     +-------------------------------------------------+  |   |
|  |                                                           |   |
|  |  2. Component Testing                                     |   |
|  |     +-------------------------------------------------+  |   |
|  |     | - Test individual components                       |  |   |
|  |     | - Verify backup/restore                            |  |   |
|  |     | - Validate replication                             |  |   |
|  |     +-------------------------------------------------+  |   |
|  |                                                           |   |
|  |  3. Simulation Testing                                   |   |
|  |     +-------------------------------------------------+  |   |
|  |     | - Simulate failure scenarios                       |  |   |
|  |     | - Test failover procedures                         |  |   |
|  |     | - Measure RTO/RPO                                 |  |   |
|  |     +-------------------------------------------------+  |   |
|  |                                                           |   |
|  |  4. Full-Scale Testing                                   |   |
|  |     +-------------------------------------------------+  |   |
|  |     | - Complete failover to DR                          |  |   |
|  |     | - Full production traffic                          |  |   |
|  |     | - Validate all systems                             |  |   |
|  |     +-------------------------------------------------+  |   |
|  |                                                           |   |
|  +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

Chaos Engineering with AWS FIS

# AWS Fault Injection Simulator Experiment
Resources:
  FISExperimentRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: fis.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSFaultInjectionSimulatorNetworkAccess

  AZFailureExperiment:
    Type: AWS::FIS::ExperimentTemplate
    Properties:
      Description: Simulate AZ failure
      RoleArn: !GetAtt FISExperimentRole.Arn
      Targets:
        Instances:
          ResourceType: aws:ec2:instance
          ResourceTags:
            Environment: production
          SelectionMode: COUNT(1)
          Parameters:
            AvailabilityZone: us-east-1a
      Actions:
        TerminateInstance:
          ActionId: aws:ec2:terminate-instances
          Parameters:
            instanceIds: ${Instances}
          Targets:
            Instances: Instances
      StopConditions:
        - Source: aws:cloudwatch:alarm
          Value: !Ref HighErrorRateAlarm
      Tags:
        Name: AZ-Failure-Test

46.10 Why This Matters in DevOps/SRE

HA/DR is the backbone of reliability engineering. SREs must design systems that survive failures and recover quickly to meet SLOs.

                    HA/DR in DevOps/SRE
+------------------------------------------------------------------+
|                                                                   |
|    SRE Reliability Engineering:                                    |
|                                                                   |
|    1. Error Budget Protection                                      |
|    +----------------------------------------------------------+   |
|    |  - HA reduces unplanned downtime                         |   |
|    |  - DR minimizes recovery time                           |   |
|    |  - Both protect error budgets                           |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    2. Incident Response                                            |
|    +----------------------------------------------------------+   |
|    |  - Runbooks for failover procedures                      |   |
|    |  - Automated detection and recovery                      |   |
|    |  - Clear RTO/RPO targets                                |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    3. Chaos Engineering                                           |
|    +----------------------------------------------------------+   |
|    |  - Regularly test failover procedures                   |   |
|    |  - Game days for team training                          |   |
|    |  - Build confidence in recovery                         |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

46.11 Linux Systems Perspective

HA/DR Automation

# Test failover
#!/bin/bash
set -euo pipefail

# Check current primary
PRIMARY=$(aws rds describe-db-instances \
    --db-instance-identifier my-db \
    --query 'DBInstances[].DBInstanceStatus')

echo "Primary status: $PRIMARY"

# Trigger failover
aws rds failover-db-cluster \
    --db-cluster-identifier my-cluster \
    --target-db-instance-id my-replica

46.12 Common Mistakes & Anti-Patterns

                    HA/DR Anti-Patterns
+------------------------------------------------------------------+
|                                                                   |
|    ❌ Mistake 1: Not Testing DR Procedures                           |
|    +----------------------------------------------------------+   |
|    |  Problem: Assumes failover works without testing          |   |
|    |  Impact: Real disasters expose untested plans            |   |
|    |  Fix: Regular game days and chaos experiments          |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    ❌ Mistake 2: Single AZ Deployments                                |
|    +----------------------------------------------------------+   |
|    |  Problem: All resources in one AZ                        |   |
|    |  Impact: AZ failure takes everything down               |   |
|    |  Fix: Deploy across multiple AZs                        |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    ❌ Mistake 3: Not Backing Up Data                                     |
|    +----------------------------------------------------------+   |
|    |  Problem: No backups or infrequent backups               |   |
|    |  Impact: Data loss on failure                           |   |
|    |  Fix: Automated backups with tested restores            |   |
|    +----------------------------------------------------------+   |
|                                                                   |
|    ❌ Mistake 4: Setting Unrealistic RTO/RPO                         |
|    +----------------------------------------------------------+   |
|    |  Problem: Targets not matching business needs           |   |
|    |  Impact: Either overspend or miss recovery targets      |   |
|    |  Fix: Align with business requirements and budget       |   |
|    +----------------------------------------------------------+   |
|                                                                   |
+------------------------------------------------------------------+

46.13 Interview Questions

Conceptual Questions

Q: Explain RTO vs RPO.
- A: RTO (Recovery Time Objective) is how long you can tolerate downtime. RPO (Recovery Point Objective) is how much data loss you can tolerate. These drive your DR strategy and architecture choices.
Q: What’s the difference between active-passive and active-active DR?
- A: Active-passive: secondary site is idle until failover. Cheaper but slower recovery. Active-active: both sites serve traffic simultaneously. More expensive but instant failover.

Scenario-Based Questions

Q: Design a multi-region disaster recovery strategy.
- A: Use primary/secondary: (1) Primary in one region, DR in another, (2) Async replication for databases, (3) Automated failover with Route53 health checks, (4) Regular DR drills, (5) Documented runbooks.

46.14 Key Takeaways

HA Design Principles

                    HA Best Practices
+------------------------------------------------------------------+
|                                                                   |
|  1. Design for Failure                                           |
|     +--------------------------------------------------------+    |
|     | - Assume components will fail                          |    |
|     | - Implement redundancy at all layers                   |    |
|     | - Use managed services with built-in HA                |    |
|     +--------------------------------------------------------+    |
|                                                                   |
|  2. Implement Health Checks                                      |
|     +--------------------------------------------------------+    |
|     | - Application-level health checks                      |    |
|     | - Load balancer health checks                          |    |
|     | - Route53 health checks                                |    |
|     | - Auto Scaling health checks                           |    |
|     +--------------------------------------------------------+    |
|                                                                   |
|  3. Automate Recovery                                            |
|     +--------------------------------------------------------+    |
|     | - Auto Scaling for self-healing                        |    |
|     | - Automated failover for databases                     |    |
|     | - Route53 automatic failover                           |    |
|     +--------------------------------------------------------+    |
|                                                                   |
|  4. Test Regularly                                               |
|     +--------------------------------------------------------+    |
|     | - Conduct regular DR tests                             |    |
|     | - Use chaos engineering                                |    |
|     | - Document and improve                                 |    |
|     +--------------------------------------------------------+    |
|                                                                   |
+------------------------------------------------------------------+

DR Checklist

# DR Readiness Checklist

## Infrastructure
- [ ] Multi-AZ deployment for all tiers
- [ ] Cross-region replication configured
- [ ] Automated backups enabled
- [ ] Backup retention meets RPO requirements

## Database
- [ ] Multi-AZ RDS or Aurora configured
- [ ] Read replicas in DR region
- [ ] Cross-region replication tested
- [ ] Point-in-time recovery enabled

## Application
- [ ] Stateless application design
- [ ] Auto Scaling configured
- [ ] Load balancer health checks
- [ ] Blue/green deployment capability

## Network
- [ ] Route53 health checks configured
- [ ] Failover routing policies
- [ ] Cross-region VPC peering
- [ ] VPN/Direct Connect redundancy

## Operations
- [ ] DR runbook documented
- [ ] Failover automation tested
- [ ] Communication plan established
- [ ] Regular DR drills scheduled

46.15 Key Takeaways

Topic	Key Points
RTO/RPO	Define recovery objectives before designing
Multi-AZ	Deploy across multiple AZs for HA
DR Strategies	Choose strategy based on RTO/RPO requirements
Automation	Automate failover and recovery processes
Testing	Regularly test DR procedures
Monitoring	Implement comprehensive health checks

46.16 References

46.17 Exam Tips

                    Key Exam Points
+------------------------------------------------------------------+
|                                                                   |
|  1. RTO (Recovery Time Objective): Max acceptable downtime       |
|                                                                   |
|  2. RPO (Recovery Point Objective): Max acceptable data loss    |
|                                                                   |
|  3. DR Strategies: Backup & Restore, Pilot Light, Warm Standby, |
|     Active-Active                                                |
|                                                                   |
|  4. Multi-AZ vs Multi-Region: AZ for HA, Region for DR          |
|                                                                   |
|  5. AWS Services: RDS Multi-AZ, Aurora Global, Route 53, ELB  |
|                                                                   |
|  6. Failover: Automatic vs Manual trigger                       |
|                                                                   |
|  7. Data Replication: Synchronous vs Asynchronous              |
|                                                                   |
|  8. Testing: Use AWS FIS for chaos engineering                  |
|                                                                   |
|  9. AWS Backup: Centralized backup management                   |
|                                                                   |
|  10. Cost vs Recovery: Balance requirements with budget        |
|                                                                   |
+------------------------------------------------------------------+

46.18 Summary

                    Chapter 46 Summary
+------------------------------------------------------------------+
|                                                                   |
|  High Availability & Disaster Recovery                            |
|  +------------------------------------------------------------+  |
|  | - RTO/RPO: Define recovery objectives                      |  |
|  | - Multi-AZ: High availability within region                |  |
|  | - Multi-Region: Disaster recovery                          |  |
|  | - Automation: Automated failover and recovery               |  |
|  +------------------------------------------------------------+  |
|                                                                   |
|  DR Strategies                                                    |
|  +------------------------------------------------------------+  |
|  | - Backup & Restore: Low cost, high RTO                     |
|  | - Pilot Light: Minimal core services                       |
|  | - Warm Standby: Scaled-down production                     |
|  | - Active-Active: Full production in multiple regions       |  |
|  +------------------------------------------------------------+  |
|                                                                   |
|  Best Practices                                                    |
|  +------------------------------------------------------------+  |
|  | - Define RTO/RPO before design                              |  |
|  | - Test failover regularly                                  |  |
|  | - Automate recovery processes                              |  |
|  | - Document runbooks                                        |  |
|  +------------------------------------------------------------+  |
|                                                                   |
+------------------------------------------------------------------+

Next Chapter: Chapter 47 - Cost Optimization & FinOps