Disaster Recovery Planning

Chapter 46: Disaster Recovery Planning

Comprehensive Disaster Recovery Strategies for Linux Systems

46.1 Disaster Recovery Fundamentals

Understanding Disaster Recovery

Disaster recovery (DR) is the process of restoring IT infrastructure and operations after a catastrophic event. Unlike regular backup recovery, DR deals with site-wide failures and aims to minimize business downtime.

                    DISASTER RECOVERY CONCEPTS
+------------------------------------------------------------------+
|                                                                   |
|    ┌─────────────────────────────────────────────────────────┐   │
|    │                  DISASTER TYPES                        │   │
|    │                                                         │   │
|    │  ┌───────────────┐  ┌───────────────┐                  │   │
|    │  │   NATURAL     │  │   HUMAN      │                  │   │
|    │  ├───────────────┤  ├───────────────┤                  │   │
|    │  │ - Earthquake  │  │ - Cyberattack │                  │   │
|    │  │ - Flood       │  │ - Terrorism   │                  │   │
|    │  │ - Fire        │  │ - Sabotage    │                  │   │
|    │  │ - Lightning   │  │ - Human error │                  │   │
|    │  │ - Storm       │  │ - War         │                  │   │
|    │  └───────────────┘  └───────────────┘                  │   │
|    │                                                         │   │
|    │  ┌───────────────┐  ┌───────────────┐                  │   │
|    │  │  TECHNICAL    │  │   FACILITY    │                  │   │
|    │  ├───────────────┤  ├───────────────┤                  │   │
|    │  │ - Hardware   │  │ - Power outage │                  │   │
|    │  │   failure    │  │ - HVAC failure │                  │   │
|    │  │ - Ransomware │  │ - Building     │                  │   │
|    │  │ - Software   │  │   damage       │                  │   │
|    │  │   corruption  │  │ - Access      │                  │   │
|    │  │              │  │   denial      │                  │   │
|    │  └───────────────┘  └───────────────┘                  │   │
|    │                                                         │   │
|    └─────────────────────────────────────────────────────────┘   │
|                                                                   |
+------------------------------------------------------------------+

Key DR Metrics

                    DISASTER RECOVERY METRICS
+------------------------------------------------------------------+
|                                                                   |
|    RTO - RECOVERY TIME OBJECTIVE                                  │
|    ─────────────────────────────────────                          │
|                                                                   │
|    ┌──────────────────────────────────────────────────────────┐  │
|    │                                                          │  │
|    │   Disaster ─────────────────────→ System Online         │  │
│    │     │                              │                    │  │
│    │     │                              │                    │  │
│    │     └──────────── RTO ─────────────┘                    │  │
|    │                                                          │  │
│    │   Maximum acceptable downtime                            │  │
|    │                                                          │  │
|    └──────────────────────────────────────────────────────────┘  │
|                                                                   |
|    RPO - RECOVERY POINT OBJECTIVE                                 │
|    ─────────────────────────────────────                          │
|                                                                   │
|    ┌──────────────────────────────────────────────────────────┐  │
│    │                                                          │  │
│    │   Last Backup ─────────→ Disaster ──→ Recovery        │  │
│    │              │                   │                      │  │
│    │              └────── RPO ─────────┘                      │  │
│    │                                                          │  │
│    │   Maximum acceptable data loss (time)                   │  │
│    │                                                          │  │
|    └──────────────────────────────────────────────────────────┘  │
|                                                                   |
|    TIER CLASSIFICATION                                            │
|    ────────────────────                                           │
|                                                                   │
|    Tier 0: No DR - RTO: Days, RPO: Days                         │
|    Tier 1: Cold Site - RTO: 24h, RPO: Days                     │
|    Tier 2: Warm Site - RTO: 4-24h, RPO: Hours                  │
|    Tier 3: Hot Site - RTO: Minutes, RPO: Minutes               |
|    Tier 4: Active-Active - RTO: Zero, RPO: Zero                 │
|                                                                   |
+------------------------------------------------------------------+

46.2 Disaster Recovery Strategies

Site Architectures

                    DR SITE ARCHITECTURES
+------------------------------------------------------------------+
|                                                                   |
|  COLD SITE                                                        |
|  ─────────                                                        │
|                                                                   |
|  Primary Site          Cold Site                                 │
|  ┌─────────┐          ┌─────────┐                               │
|  │ Active  │          │ Empty   │                               │
│  │ Systems │          │ Building │                               │
|  └────┬────┘          └────┬────┘                               │
|       │                     │                                     |
|       │   Disaster          │                                     │
|       └───────────×──────────┘                                     |
|                     │                                             │
|                     ▼                                             │
|          ┌─────────────────────┐                                  │
│          │ Deploy hardware    │                                  │
│          │ Restore from backup│                                  │
│          │ Bring online       │                                  │
│          └─────────────────────┘                                  |
|          RTO: 24-72 hours                                         |
|                                                                   |
|  WARM SITE                                                        │
|  ─────────                                                        │
|                                                                   |
|  Primary Site          Warm Site                                 │
|  ┌─────────┐          ┌─────────┐                               │
|  │ Active  │          │ Partial │                               │
|  │ Systems │◄───────→│ Systems │                               │
│  │         │ Replica │         │                               │
|  └────┬────┘          └────┬────┘                               │
|       │                     │                                     |
|       │   Disaster          │                                     │
|       └───────────×──────────┘                                     |
|                     │                                             │
|                     ▼                                             │
|          ┌─────────────────────┐                                  │
│          │ Promote to primary  │                                  │
|          │ Restore recent data │                                  │
|          └─────────────────────┘                                  │
|          RTO: 1-4 hours                                           |
|                                                                   |
|  HOT SITE                                                         │
|  ───────                                                          │
|                                                                   │
|  Primary Site          Hot Site                                  │
|  ┌─────────┐          ┌─────────┐                               │
│  │ Active  │══Sync═══→│ Mirrored│                               │
│  │ Systems │          │ Systems │                               │
│  └────┬────┘          └────┬────┘                               │
|       │                     │                                     |
|       │   Disaster          │                                     |
|       └───────────×──────────┘                                     |
|                     │                                             │
|                     ▼                                             │
|          ┌─────────────────────┐                                  │
│          │ Failover automatic  │                                  │
|          │ or one-click        │                                  |
|          └─────────────────────┘                                  |
|          RTO: Minutes                                              │
|                                                                   |
+------------------------------------------------------------------+

Data Replication Strategies

# =============================================================================
# BLOCK-LEVEL REPLICATION (DRBD)
# =============================================================================

# Install DRBD
apt-get install drbd-utils

# Configure DRBD (/etc/drbd.d/r0.res)
resource r0 {
  on server1 {
    device /dev/drbd0;
    disk /dev/sda1;
    address 192.168.1.10:7788;
    meta-disk internal;
  }
  on server2 {
    device /dev/drbd0;
    disk /dev/sda1;
    address 192.168.1.20:7788;
    meta-disk internal;
  }
}

# Initialize DRBD
drbdadm create-md r0
drbdadm up r0

# Primary (on primary site)
drbdadm primary --force r0

# Verify status
cat /proc/drbd

# =============================================================================
# DATABASE REPLICATION
# =============================================================================

# MySQL/MariaDB Master-Slave Replication
# On Master (/etc/mysql/my.cnf):
[m mysqld]
server-id = 1
log-bin = mysql-bin
binlog-do-db = myapp

# On Slave (/etc/mysql/my.cnf):
[mysqld]
server-id = 2
relay-log = relay-bin
read-only = 1

# Commands on Master:
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%' IDENTIFIED BY 'password';
FLUSH PRIVILEGES;

# Commands on Slave:
CHANGE MASTER TO
    MASTER_HOST='master_ip',
    MASTER_USER='repl',
    MASTER_PASSWORD='password',
    MASTER_LOG_FILE='mysql-bin.000001',
    MASTER_LOG_POS=0;
START SLAVE;
SHOW SLAVE STATUS\G

# =============================================================================
# FILE-LEVEL REPLICATION (rsync + cron)
# =============================================================================

# Replication script
#!/bin/bash
# rsync-based file replication

SOURCE="/data"
DEST="backup@dr-site:/backup/data"

# Real-time sync with inotify
inotifywait -mrq -e create,modify,delete,move $SOURCE | while read events; do
    rsync -avz --delete -e ssh $SOURCE/ $DEST/
done

46.3 Disaster Recovery Planning

DR Plan Components

                DISASTER RECOVERY PLAN STRUCTURE
+------------------------------------------------------------------+
|                                                                   |
|    ┌─────────────────────────────────────────────────────────┐   │
|    │                  DR PLAN SECTIONS                       │   │
|    │                                                         │   │
|    │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐   │   │
|    │  │   1. RISK   │  │   2. TEAM   │  │  3. DOCUMENT│   │   │
|    │  │  ASSESSMENT│  │   ROLES    │  │  INVENTORY  │   │   │
|    │  │             │  │             │  │             │   │   │
|    │  │ Identify   │  │ DR manager │  │ Systems     │   │   │
|    │  │ threats    │  │ Technical   │  │ Applications│   │   │
|    │  │             │  │ support     │  │ Data        │   │   │
|    │  │ Evaluate   │  │ Communica-  │  │ Dependen-   │   │   │
|    │  │ likelihood │  │ tions       │  │ cies        │   │   │
│    │  │             │  │             │  │             │   │   │
|    │  └─────────────┘  └─────────────┘  └─────────────┘   │   │
|    │                                                         │   │
|    │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐   │   │
|    │  │  4. BACKUP  │  │  5. FAILOVER│  │  6. RETURN  │   │   │
|    │  │  STRATEGY  │  │  PROCEDURES │  │  PROCEDURES │   │   │
|    │  │             │  │             │  │             │   │   │
│    │  │ Replication│  │ Step-by-step│  │ Step-by-step│   │   │
│    │  │ Schedule   │  │ failover    │  │ failback    │   │   │
|    │  │ Retention  │  │ Runbooks    │  │ Validation  │   │   │
|    │  │ Testing    │  │ Scripts     │  │ Timeline    │   │   │
|    │  │             │  │             │  │             │   │   │
|    │  └─────────────┘  └─────────────┘  └─────────────┘   │   │
|    │                                                         │   │
|    │  ┌─────────────────────────────────────────────────┐   │   │
|    │  │              7. TESTING & MAINTENANCE          │   │   │
|    │  ├─────────────────────────────────────────────────┤   │   │
|    │  │ - Monthly tabletop exercises                     │   │   │
|    │  │ - Quarterly partial failover tests              │   │   │
|    │  │ - Annual full failover tests                    │   │   │
|    │  │ - Document lessons learned                      │   │   │
|    │  │ - Update plan based on changes                  │   │   │
|    │  └─────────────────────────────────────────────────┘   │   │
|    │                                                         │   │
|    └─────────────────────────────────────────────────────────┘   │
|                                                                   |
+------------------------------------------------------------------+

Creating a DR Plan

# =============================================================================
# DOCUMENTATION TEMPLATE
# =============================================================================

# Create DR plan document: /root/dr-plan.md

# Document structure:
# 1. Executive Summary
# 2. Scope and Objectives
# 3. Risk Assessment
# 4. Contact Information
# 5. System Inventory
# 6. Recovery Procedures
# 7. Runbooks
# 8. Testing Schedule

# Example system inventory format:
cat > /root/dr-inventory.csv <<'EOF'
System,Application,Owner,RTO,RPO,Critical,Dependencies
web01,nginx,admin@company.com,1h,15min,Yes,DB01
db01,mysql,admin@company.com,2h,1h,Yes,-
app01,nodejs,admin@company.com,1h,15min,Yes,DB01
EOF

46.4 Failover and Failback Procedures

Automated Failover with Keepalived

# =============================================================================
# KEEPALIVED SETUP FOR HA
# =============================================================================

# Install keepalived
apt-get install keepalived

# /etc/keepalived/keepalived.conf on PRIMARY
global_defs {
    router_id LVS_DEVEL
    notification_email {
        admin@example.com
    }
    notification_email_from keepalived@example.com
    smtp_server localhost
    smtp_connect_timeout 30
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass secret123
    }
    virtual_ipaddress {
        192.168.1.100 dev eth0
    }
    track_script {
        check_nginx
    }
}

vrrp_script check_nginx {
    script "/usr/local/bin/check_nginx.sh"
    interval 2
    weight 2
}

# /etc/keepalived/keepalived.conf on BACKUP
vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 90
    # ... rest same as master
}

# Nginx check script
cat > /usr/local/bin/check_nginx.sh <<'EOF'
#!/bin/bash
if ! pgrep nginx > /dev/null; then
    exit 1
fi
exit 0
EOF

chmod +x /usr/local/bin/check_nginx.sh

# Start keepalived
systemctl enable keepalived
systemctl start keepalived

Manual Failover Procedure

#!/bin/bash
#===============================================================================
# Manual Failover Script
# Run this on DR site when primary fails
#===============================================================================

set -euo pipefail

# Configuration
PRIMARY_IP="192.168.1.10"
DR_IP="192.168.1.20"
VIP="192.168.1.100"
APP_USER="appuser"
APP_DIR="/opt/application"
DB_NAME="myapp"
DB_USER="dbuser"

log() {
    echo "[$(date)] $1"
}

# Step 1: Verify primary is unreachable
check_primary_down() {
    log "Checking if primary is down..."
    if ping -c 3 -W 2 "$PRIMARY_IP" &>/dev/null; then
        log "WARNING: Primary is still reachable!"
        read -p "Continue anyway? (yes/no): " confirm
        if [ "$confirm" != "yes" ]; then
            exit 1
        fi
    fi
    log "Primary confirmed down"
}

# Step 2: Stop replication
stop_replication() {
    log "Stopping replication..."
    # Database commands to stop replication
    mysql -u root -e "STOP SLAVE;"
    mysql -u root -e "RESET SLAVE ALL;"
}

# Step 3: Promote DR server
promote_server() {
    log "Promoting DR server..."
    # Make database writable
    mysql -u root -e "RESET MASTER;"

    # Start application services
    systemctl start nginx
    systemctl start myapp

    # Configure VIP on DR
    ip addr add "$VIP/24" dev eth0
}

# Step 4: Update DNS
update_dns() {
    log "Updating DNS..."
    # Update DNS records
    nsupdate -k /etc/bind/ddns.key <<EOF
server ns1.example.com
update delete app.example.com A
update add app.example.com 300 A $DR_IP
send
EOF
}

# Main
main() {
    log "Starting manual failover to DR site"
    check_primary_down
    stop_replication
    promote_server
    update_dns
    log "Failover completed successfully"
}

main "$@"

Failback Procedure

#!/bin/bash
#===============================================================================
# Failback Script
# Run this to return to primary site after repairs
#===============================================================================

set -euo pipefail

PRIMARY_IP="192.168.1.10"
DR_IP="192.168.1.20"
VIP="192.168.1.100"

log() {
    echo "[$(date)] $1"
}

# Step 1: Sync data from DR to Primary
sync_data() {
    log "Syncing data from DR to Primary..."
    # Sync database
    mysqldump -u root myapp | mysql -h "$PRIMARY_IP" myapp

    # Sync files
    rsync -avz -e ssh "$DR_IP:$APP_DIR/" "$APP_DIR/"
}

# Step 2: Stop services on DR
stop_dr_services() {
    log "Stopping DR services..."
    ssh "$DR_IP" "systemctl stop nginx"
    ssh "$DR_IP" "systemctl stop myapp"
    ssh "$DR_IP" "ip addr del $VIP/24 dev eth0"
}

# Step 3: Start services on Primary
start_primary() {
    log "Starting Primary services..."
    systemctl start nginx
    systemctl start myapp
    ip addr add "$VIP/24" dev eth0
}

# Step 4: Verify
verify() {
    log "Verifying Primary..."
    if ping -c 3 "$VIP" &>/dev/null; then
        log "Failback completed successfully"
    else
        log "ERROR: Failback verification failed"
        exit 1
    fi
}

main() {
    log "Starting failback to Primary site"
    sync_data
    stop_dr_services
    start_primary
    verify
}

main "$@"

46.5 DR Testing

Testing Procedures

                    DR TESTING TIERS
+------------------------------------------------------------------+
|                                                                   |
|    ┌─────────────────────────────────────────────────────────┐   │
|    │                   TIER 1: TABLETOP                      │   │
|    │                                                         │   │
|    │   Frequency: Monthly                                    │   │
|    │   Duration: 2-4 hours                                   │   │
|    │                                                         │   │
|    │   - Walk through DR procedures                         │   │
|    │   - Discuss scenarios                                  │   │
|    │   - Identify gaps                                      │   │
|    │   - No actual failover                                 │   │
|    │                                                         │   │
|    └─────────────────────────────────────────────────────────┘   │
|                                                                   |
|    ┌─────────────────────────────────────────────────────────┐   │
|    │               TIER 2: COMPONENT TEST                    │   │
|    │                                                         │   │
|    │   Frequency: Quarterly                                  │   │
|    │   Duration: 1-2 days                                   │   │
|    │                                                         │   │
|    │   - Test individual components                         │   │
|    │   - Verify backup restoration                         │   │
|    │   - Test failover of single system                     │   │
|    │   - Validate monitoring                                │   │
|    │                                                         │   │
|    └─────────────────────────────────────────────────────────┘   │
|                                                                   │
|    ┌─────────────────────────────────────────────────────────┐   │
|    │                 TIER 3: FULL FAILOVER                   │   │
|    │                                                         │   │
|    │   Frequency: Annually                                  │   │
|    │   Duration: 1-2 days                                   │   │
|    │                                                         │   │
|    │   - Complete failover to DR site                      │   │
|    │   - Run production workload                            │   │
|    │   - Test all systems                                  │   │
|    │   - Test failback                                     │   │
|    │                                                         │   │
|    └─────────────────────────────────────────────────────────┘   │
|                                                                   |
+------------------------------------------------------------------+

DR Test Checklist

                    DR TEST CHECKLIST
+------------------------------------------------------------------+
|                                                                   |
|  ┌─────────────────────────────────────────────────────────────┐ │
|  │ PRE-TEST                                                    │ │
|  ├─────────────────────────────────────────────────────────────┤ │
|  │ □ Schedule maintenance window                              │ │
|  │ □ Notify all stakeholders                                 │ │
|  │ □ Verify backups are current                              │ │
|  │ □ Document baseline performance                           │ │
|  │ □ Prepare test scripts                                    │ │
|  │ □ Verify DR site is ready                                 │ │
|  └─────────────────────────────────────────────────────────────┘ │
|                                                                   |
|  ┌─────────────────────────────────────────────────────────────┐ │
|  │ DURING TEST                                                 │ │
|  ├─────────────────────────────────────────────────────────────┤ │
|  │ □ Execute failover procedure                              │ │
|  │ □ Verify all services start                               │ │
|  │ □ Test application functionality                          │ │
|  │ □ Test user access                                        │ │
|  │ □ Verify data integrity                                   │ │
|  │ □ Document any issues                                     │ │
|  │ □ Time each step                                          │ │
|  └─────────────────────────────────────────────────────────────┘ │
|                                                                   |
|  ┌─────────────────────────────────────────────────────────────┐ │
|  │ POST-TEST                                                   │ │
|  ├─────────────────────────────────────────────────────────────┤ │
|  │ □ Execute failback procedure                              │ │
|  │ □ Verify primary is restored                              │ │
|  │ □ Compare performance to baseline                         │ │
|  │ □ Document lessons learned                                │ │
|  │ □ Update DR plan if needed                                 │ │
|  │ □ Send summary to stakeholders                            │ │
|  └─────────────────────────────────────────────────────────────┘ │
|                                                                   |
+------------------------------------------------------------------+

46.6 DR Best Practices

Summary Checklist

                DISASTER RECOVERY BEST PRACTICES
+------------------------------------------------------------------+
|                                                                   |
|  ┌─────────────────────────────────────────────────────────────┐ │
|  │ PLANNING                                                     │ │
|  ├─────────────────────────────────────────────────────────────┤ │
|  │ □ Define RTO and RPO for each system                       │ │
|  │ □ Document all dependencies                               │ │
|  │ □ Maintain up-to-date system inventory                     │ │
|  │ □ Regular DR plan reviews (quarterly)                     │ │
|  │ □ Document contact information                             │ │
|  └─────────────────────────────────────────────────────────────┘ │
|                                                                   |
|  ┌─────────────────────────────────────────────────────────────┐ │
|  │ INFRASTRUCTURE                                              │ │
|  ├─────────────────────────────────────────────────────────────┤ │
|  │ □ Geographic separation of DR site                        │ │
|  │ □ Redundant network connectivity                           │ │
|  │ □ Regular testing of DR site                              │ │
|  │ □ Monitor replication status                               │ │
|  │ □ Keep DR site powered and cooled                         │ │
|  └─────────────────────────────────────────────────────────────┘ │
|                                                                   |
|  ┌─────────────────────────────────────────────────────────────┐ │
|  │ PROCEDURES                                                   │ │
|  ├─────────────────────────────────────────────────────────────┤ │
|  │ □ Document clear runbooks                                 │ │
|  │ □ Automate where possible                                  │ │
|  │ □ Regular testing (monthly tabletop)                       │ │
|  │ □ Annual full failover test                               │ │
|  │ □ Clear escalation procedures                              │ │
|  └─────────────────────────────────────────────────────────────┘ │
|                                                                   |
|  ┌─────────────────────────────────────────────────────────────┐ │
|  │ COMMUNICATION                                               │ │
|  ├─────────────────────────────────────────────────────────────┤ │
|  │ □ Stakeholder notification procedures                     │ │
|  │ □ Status update templates                                  │ │
|  │ □ Post-incident communication                             │ │
|  │ □ Public relations plan (if needed)                       │ │
|  └─────────────────────────────────────────────────────────────┘ │
|                                                                   |
+------------------------------------------------------------------+

Why This Matters in DevOps/SRE

Disaster recovery planning is essential for business continuity:

         Disaster Recovery in DevOps/SRE
+------------------------------------------------------------------+
|                                                                   |
|    RTO/RPO Tiers:                                                  |
|    +----------------------------------------------------------+  |
|    | Tier 1 -> RTO < 4h, RPO < 15min (Mission Critical)   |  |
|    | Tier 2 -> RTO < 24h, RPO < 1h (Business Critical)   |  |
|    | Tier 3 -> RTO < 72h, RPO < 24h (Standard)           |  |
|    +----------------------------------------------------------+  |
|                                                                   |
|    Multi-Region Strategies:                                         |
|    +----------------------------------------------------------+  |
|    | Active-Active -> Both regions serve traffic          |  |
|    | Active-Passive -> Standby until failover             |  |
|    | Pilot Light -> Minimal version of core services       |  |
|    +----------------------------------------------------------+  |
|                                                                   |
|    SRE Integration:                                                 |
|    +----------------------------------------------------------+  |
|    | Error budgets -> Don't over-invest in DR           |  |
|    | Chaos engineering -> Test DR regularly              |  |
|    | Post-Mortem -> Learn from failures                  |  |
|    +----------------------------------------------------------+  |
|                                                                   |
+------------------------------------------------------------------+

Practical Impact:

Meet business continuity requirements
Comply with regulations
Minimize financial impact of disasters

Common Mistakes & Anti-Patterns

1. Not Testing DR Plan

# WRONG: DR plan never tested
# Assumes everything will work
# First test during actual disaster = failure

# CORRECT: Regular DR tests
# Monthly tabletop exercises
# Quarterly failover tests
# Annual full DR drill

2. Single Point of Failure

# WRONG: All systems in one region
# Regional outage = complete outage

# CORRECT: Multi-region architecture
# Primary/secondary regions
# Cross-region replication

3. Not Prioritizing

# WRONG: Treating all systems equally
# Over-investing in non-critical systems
# Under-investing in critical systems

# CORRECT: Tier-based DR
# Critical systems first
# Cost-effective solutions

Interview Questions

What is the difference between RTO and RPO?
Explain different DR strategies.
What is a disaster recovery runbook?
How often should you test DR plans?
What is chaos engineering?

End of Chapter 46: Disaster Recovery Planning