Disaster_recovery
Chapter 46: Disaster Recovery Planning
Section titled “Chapter 46: Disaster Recovery Planning”Comprehensive Disaster Recovery Strategies for Linux Systems
Section titled “Comprehensive Disaster Recovery Strategies for Linux Systems”46.1 Disaster Recovery Fundamentals
Section titled “46.1 Disaster Recovery Fundamentals”Understanding Disaster Recovery
Section titled “Understanding Disaster Recovery”Disaster recovery (DR) is the process of restoring IT infrastructure and operations after a catastrophic event. Unlike regular backup recovery, DR deals with site-wide failures and aims to minimize business downtime.
DISASTER RECOVERY CONCEPTS+------------------------------------------------------------------+| || ┌─────────────────────────────────────────────────────────┐ │| │ DISASTER TYPES │ │| │ │ │| │ ┌───────────────┐ ┌───────────────┐ │ │| │ │ NATURAL │ │ HUMAN │ │ │| │ ├───────────────┤ ├───────────────┤ │ │| │ │ - Earthquake │ │ - Cyberattack │ │ │| │ │ - Flood │ │ - Terrorism │ │ │| │ │ - Fire │ │ - Sabotage │ │ │| │ │ - Lightning │ │ - Human error │ │ │| │ │ - Storm │ │ - War │ │ │| │ └───────────────┘ └───────────────┘ │ │| │ │ │| │ ┌───────────────┐ ┌───────────────┐ │ │| │ │ TECHNICAL │ │ FACILITY │ │ │| │ ├───────────────┤ ├───────────────┤ │ │| │ │ - Hardware │ │ - Power outage │ │ │| │ │ failure │ │ - HVAC failure │ │ │| │ │ - Ransomware │ │ - Building │ │ │| │ │ - Software │ │ damage │ │ │| │ │ corruption │ │ - Access │ │ │| │ │ │ │ denial │ │ │| │ └───────────────┘ └───────────────┘ │ │| │ │ │| └─────────────────────────────────────────────────────────┘ │| |+------------------------------------------------------------------+Key DR Metrics
Section titled “Key DR Metrics” DISASTER RECOVERY METRICS+------------------------------------------------------------------+| || RTO - RECOVERY TIME OBJECTIVE │| ───────────────────────────────────── │| │| ┌──────────────────────────────────────────────────────────┐ │| │ │ │| │ Disaster ─────────────────────→ System Online │ ││ │ │ │ │ ││ │ │ │ │ ││ │ └──────────── RTO ─────────────┘ │ │| │ │ ││ │ Maximum acceptable downtime │ │| │ │ │| └──────────────────────────────────────────────────────────┘ │| || RPO - RECOVERY POINT OBJECTIVE │| ───────────────────────────────────── │| │| ┌──────────────────────────────────────────────────────────┐ ││ │ │ ││ │ Last Backup ─────────→ Disaster ──→ Recovery │ ││ │ │ │ │ ││ │ └────── RPO ─────────┘ │ ││ │ │ ││ │ Maximum acceptable data loss (time) │ ││ │ │ │| └──────────────────────────────────────────────────────────┘ │| || TIER CLASSIFICATION │| ──────────────────── │| │| Tier 0: No DR - RTO: Days, RPO: Days │| Tier 1: Cold Site - RTO: 24h, RPO: Days │| Tier 2: Warm Site - RTO: 4-24h, RPO: Hours │| Tier 3: Hot Site - RTO: Minutes, RPO: Minutes || Tier 4: Active-Active - RTO: Zero, RPO: Zero │| |+------------------------------------------------------------------+46.2 Disaster Recovery Strategies
Section titled “46.2 Disaster Recovery Strategies”Site Architectures
Section titled “Site Architectures” DR SITE ARCHITECTURES+------------------------------------------------------------------+| || COLD SITE || ───────── │| || Primary Site Cold Site │| ┌─────────┐ ┌─────────┐ │| │ Active │ │ Empty │ ││ │ Systems │ │ Building │ │| └────┬────┘ └────┬────┘ │| │ │ || │ Disaster │ │| └───────────×──────────┘ || │ │| ▼ │| ┌─────────────────────┐ ││ │ Deploy hardware │ ││ │ Restore from backup│ ││ │ Bring online │ ││ └─────────────────────┘ || RTO: 24-72 hours || || WARM SITE │| ───────── │| || Primary Site Warm Site │| ┌─────────┐ ┌─────────┐ │| │ Active │ │ Partial │ │| │ Systems │◄───────→│ Systems │ ││ │ │ Replica │ │ │| └────┬────┘ └────┬────┘ │| │ │ || │ Disaster │ │| └───────────×──────────┘ || │ │| ▼ │| ┌─────────────────────┐ ││ │ Promote to primary │ │| │ Restore recent data │ │| └─────────────────────┘ │| RTO: 1-4 hours || || HOT SITE │| ─────── │| │| Primary Site Hot Site │| ┌─────────┐ ┌─────────┐ ││ │ Active │══Sync═══→│ Mirrored│ ││ │ Systems │ │ Systems │ ││ └────┬────┘ └────┬────┘ │| │ │ || │ Disaster │ || └───────────×──────────┘ || │ │| ▼ │| ┌─────────────────────┐ ││ │ Failover automatic │ │| │ or one-click │ || └─────────────────────┘ || RTO: Minutes │| |+------------------------------------------------------------------+Data Replication Strategies
Section titled “Data Replication Strategies”# =============================================================================# BLOCK-LEVEL REPLICATION (DRBD)# =============================================================================
# Install DRBDapt-get install drbd-utils
# Configure DRBD (/etc/drbd.d/r0.res)resource r0 { on server1 { device /dev/drbd0; disk /dev/sda1; address 192.168.1.10:7788; meta-disk internal; } on server2 { device /dev/drbd0; disk /dev/sda1; address 192.168.1.20:7788; meta-disk internal; }}
# Initialize DRBDdrbdadm create-md r0drbdadm up r0
# Primary (on primary site)drbdadm primary --force r0
# Verify statuscat /proc/drbd
# =============================================================================# DATABASE REPLICATION# =============================================================================
# MySQL/MariaDB Master-Slave Replication# On Master (/etc/mysql/my.cnf):[m mysqld]server-id = 1log-bin = mysql-binbinlog-do-db = myapp
# On Slave (/etc/mysql/my.cnf):[mysqld]server-id = 2relay-log = relay-binread-only = 1
# Commands on Master:GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%' IDENTIFIED BY 'password';FLUSH PRIVILEGES;
# Commands on Slave:CHANGE MASTER TO MASTER_HOST='master_ip', MASTER_USER='repl', MASTER_PASSWORD='password', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS=0;START SLAVE;SHOW SLAVE STATUS\G
# =============================================================================# FILE-LEVEL REPLICATION (rsync + cron)# =============================================================================
# Replication script#!/bin/bash# rsync-based file replication
SOURCE="/data"DEST="backup@dr-site:/backup/data"
# Real-time sync with inotifyinotifywait -mrq -e create,modify,delete,move $SOURCE | while read events; do rsync -avz --delete -e ssh $SOURCE/ $DEST/done46.3 Disaster Recovery Planning
Section titled “46.3 Disaster Recovery Planning”DR Plan Components
Section titled “DR Plan Components” DISASTER RECOVERY PLAN STRUCTURE+------------------------------------------------------------------+| || ┌─────────────────────────────────────────────────────────┐ │| │ DR PLAN SECTIONS │ │| │ │ │| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │| │ │ 1. RISK │ │ 2. TEAM │ │ 3. DOCUMENT│ │ │| │ │ ASSESSMENT│ │ ROLES │ │ INVENTORY │ │ │| │ │ │ │ │ │ │ │ │| │ │ Identify │ │ DR manager │ │ Systems │ │ │| │ │ threats │ │ Technical │ │ Applications│ │ │| │ │ │ │ support │ │ Data │ │ │| │ │ Evaluate │ │ Communica- │ │ Dependen- │ │ │| │ │ likelihood │ │ tions │ │ cies │ │ ││ │ │ │ │ │ │ │ │ │| │ └─────────────┘ └─────────────┘ └─────────────┘ │ │| │ │ │| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │| │ │ 4. BACKUP │ │ 5. FAILOVER│ │ 6. RETURN │ │ │| │ │ STRATEGY │ │ PROCEDURES │ │ PROCEDURES │ │ │| │ │ │ │ │ │ │ │ ││ │ │ Replication│ │ Step-by-step│ │ Step-by-step│ │ ││ │ │ Schedule │ │ failover │ │ failback │ │ │| │ │ Retention │ │ Runbooks │ │ Validation │ │ │| │ │ Testing │ │ Scripts │ │ Timeline │ │ │| │ │ │ │ │ │ │ │ │| │ └─────────────┘ └─────────────┘ └─────────────┘ │ │| │ │ │| │ ┌─────────────────────────────────────────────────┐ │ │| │ │ 7. TESTING & MAINTENANCE │ │ │| │ ├─────────────────────────────────────────────────┤ │ │| │ │ - Monthly tabletop exercises │ │ │| │ │ - Quarterly partial failover tests │ │ │| │ │ - Annual full failover tests │ │ │| │ │ - Document lessons learned │ │ │| │ │ - Update plan based on changes │ │ │| │ └─────────────────────────────────────────────────┘ │ │| │ │ │| └─────────────────────────────────────────────────────────┘ │| |+------------------------------------------------------------------+Creating a DR Plan
Section titled “Creating a DR Plan”# =============================================================================# DOCUMENTATION TEMPLATE# =============================================================================
# Create DR plan document: /root/dr-plan.md
# Document structure:# 1. Executive Summary# 2. Scope and Objectives# 3. Risk Assessment# 4. Contact Information# 5. System Inventory# 6. Recovery Procedures# 7. Runbooks# 8. Testing Schedule
# Example system inventory format:cat > /root/dr-inventory.csv <<'EOF'System,Application,Owner,RTO,RPO,Critical,Dependenciesweb01,nginx,admin@company.com,1h,15min,Yes,DB01db01,mysql,admin@company.com,2h,1h,Yes,-app01,nodejs,admin@company.com,1h,15min,Yes,DB01EOF46.4 Failover and Failback Procedures
Section titled “46.4 Failover and Failback Procedures”Automated Failover with Keepalived
Section titled “Automated Failover with Keepalived”# =============================================================================# KEEPALIVED SETUP FOR HA# =============================================================================
# Install keepalivedapt-get install keepalived
# /etc/keepalived/keepalived.conf on PRIMARYglobal_defs { router_id LVS_DEVEL notification_email { admin@example.com } notification_email_from keepalived@example.com smtp_server localhost smtp_connect_timeout 30}
vrrp_instance VI_1 { state MASTER interface eth0 virtual_router_id 51 priority 100 advert_int 1 authentication { auth_type PASS auth_pass secret123 } virtual_ipaddress { 192.168.1.100 dev eth0 } track_script { check_nginx }}
vrrp_script check_nginx { script "/usr/local/bin/check_nginx.sh" interval 2 weight 2}
# /etc/keepalived/keepalived.conf on BACKUPvrrp_instance VI_1 { state BACKUP interface eth0 virtual_router_id 51 priority 90 # ... rest same as master}
# Nginx check scriptcat > /usr/local/bin/check_nginx.sh <<'EOF'#!/bin/bashif ! pgrep nginx > /dev/null; then exit 1fiexit 0EOF
chmod +x /usr/local/bin/check_nginx.sh
# Start keepalivedsystemctl enable keepalivedsystemctl start keepalivedManual Failover Procedure
Section titled “Manual Failover Procedure”#!/bin/bash#===============================================================================# Manual Failover Script# Run this on DR site when primary fails#===============================================================================
set -euo pipefail
# ConfigurationPRIMARY_IP="192.168.1.10"DR_IP="192.168.1.20"VIP="192.168.1.100"APP_USER="appuser"APP_DIR="/opt/application"DB_NAME="myapp"DB_USER="dbuser"
log() { echo "[$(date)] $1"}
# Step 1: Verify primary is unreachablecheck_primary_down() { log "Checking if primary is down..." if ping -c 3 -W 2 "$PRIMARY_IP" &>/dev/null; then log "WARNING: Primary is still reachable!" read -p "Continue anyway? (yes/no): " confirm if [ "$confirm" != "yes" ]; then exit 1 fi fi log "Primary confirmed down"}
# Step 2: Stop replicationstop_replication() { log "Stopping replication..." # Database commands to stop replication mysql -u root -e "STOP SLAVE;" mysql -u root -e "RESET SLAVE ALL;"}
# Step 3: Promote DR serverpromote_server() { log "Promoting DR server..." # Make database writable mysql -u root -e "RESET MASTER;"
# Start application services systemctl start nginx systemctl start myapp
# Configure VIP on DR ip addr add "$VIP/24" dev eth0}
# Step 4: Update DNSupdate_dns() { log "Updating DNS..." # Update DNS records nsupdate -k /etc/bind/ddns.key <<EOFserver ns1.example.comupdate delete app.example.com Aupdate add app.example.com 300 A $DR_IPsendEOF}
# Mainmain() { log "Starting manual failover to DR site" check_primary_down stop_replication promote_server update_dns log "Failover completed successfully"}
main "$@"Failback Procedure
Section titled “Failback Procedure”#!/bin/bash#===============================================================================# Failback Script# Run this to return to primary site after repairs#===============================================================================
set -euo pipefail
PRIMARY_IP="192.168.1.10"DR_IP="192.168.1.20"VIP="192.168.1.100"
log() { echo "[$(date)] $1"}
# Step 1: Sync data from DR to Primarysync_data() { log "Syncing data from DR to Primary..." # Sync database mysqldump -u root myapp | mysql -h "$PRIMARY_IP" myapp
# Sync files rsync -avz -e ssh "$DR_IP:$APP_DIR/" "$APP_DIR/"}
# Step 2: Stop services on DRstop_dr_services() { log "Stopping DR services..." ssh "$DR_IP" "systemctl stop nginx" ssh "$DR_IP" "systemctl stop myapp" ssh "$DR_IP" "ip addr del $VIP/24 dev eth0"}
# Step 3: Start services on Primarystart_primary() { log "Starting Primary services..." systemctl start nginx systemctl start myapp ip addr add "$VIP/24" dev eth0}
# Step 4: Verifyverify() { log "Verifying Primary..." if ping -c 3 "$VIP" &>/dev/null; then log "Failback completed successfully" else log "ERROR: Failback verification failed" exit 1 fi}
main() { log "Starting failback to Primary site" sync_data stop_dr_services start_primary verify}
main "$@"46.5 DR Testing
Section titled “46.5 DR Testing”Testing Procedures
Section titled “Testing Procedures” DR TESTING TIERS+------------------------------------------------------------------+| || ┌─────────────────────────────────────────────────────────┐ │| │ TIER 1: TABLETOP │ │| │ │ │| │ Frequency: Monthly │ │| │ Duration: 2-4 hours │ │| │ │ │| │ - Walk through DR procedures │ │| │ - Discuss scenarios │ │| │ - Identify gaps │ │| │ - No actual failover │ │| │ │ │| └─────────────────────────────────────────────────────────┘ │| || ┌─────────────────────────────────────────────────────────┐ │| │ TIER 2: COMPONENT TEST │ │| │ │ │| │ Frequency: Quarterly │ │| │ Duration: 1-2 days │ │| │ │ │| │ - Test individual components │ │| │ - Verify backup restoration │ │| │ - Test failover of single system │ │| │ - Validate monitoring │ │| │ │ │| └─────────────────────────────────────────────────────────┘ │| │| ┌─────────────────────────────────────────────────────────┐ │| │ TIER 3: FULL FAILOVER │ │| │ │ │| │ Frequency: Annually │ │| │ Duration: 1-2 days │ │| │ │ │| │ - Complete failover to DR site │ │| │ - Run production workload │ │| │ - Test all systems │ │| │ - Test failback │ │| │ │ │| └─────────────────────────────────────────────────────────┘ │| |+------------------------------------------------------------------+DR Test Checklist
Section titled “DR Test Checklist” DR TEST CHECKLIST+------------------------------------------------------------------+| || ┌─────────────────────────────────────────────────────────────┐ │| │ PRE-TEST │ │| ├─────────────────────────────────────────────────────────────┤ │| │ □ Schedule maintenance window │ │| │ □ Notify all stakeholders │ │| │ □ Verify backups are current │ │| │ □ Document baseline performance │ │| │ □ Prepare test scripts │ │| │ □ Verify DR site is ready │ │| └─────────────────────────────────────────────────────────────┘ │| || ┌─────────────────────────────────────────────────────────────┐ │| │ DURING TEST │ │| ├─────────────────────────────────────────────────────────────┤ │| │ □ Execute failover procedure │ │| │ □ Verify all services start │ │| │ □ Test application functionality │ │| │ □ Test user access │ │| │ □ Verify data integrity │ │| │ □ Document any issues │ │| │ □ Time each step │ │| └─────────────────────────────────────────────────────────────┘ │| || ┌─────────────────────────────────────────────────────────────┐ │| │ POST-TEST │ │| ├─────────────────────────────────────────────────────────────┤ │| │ □ Execute failback procedure │ │| │ □ Verify primary is restored │ │| │ □ Compare performance to baseline │ │| │ □ Document lessons learned │ │| │ □ Update DR plan if needed │ │| │ □ Send summary to stakeholders │ │| └─────────────────────────────────────────────────────────────┘ │| |+------------------------------------------------------------------+46.6 DR Best Practices
Section titled “46.6 DR Best Practices”Summary Checklist
Section titled “Summary Checklist” DISASTER RECOVERY BEST PRACTICES+------------------------------------------------------------------+| || ┌─────────────────────────────────────────────────────────────┐ │| │ PLANNING │ │| ├─────────────────────────────────────────────────────────────┤ │| │ □ Define RTO and RPO for each system │ │| │ □ Document all dependencies │ │| │ □ Maintain up-to-date system inventory │ │| │ □ Regular DR plan reviews (quarterly) │ │| │ □ Document contact information │ │| └─────────────────────────────────────────────────────────────┘ │| || ┌─────────────────────────────────────────────────────────────┐ │| │ INFRASTRUCTURE │ │| ├─────────────────────────────────────────────────────────────┤ │| │ □ Geographic separation of DR site │ │| │ □ Redundant network connectivity │ │| │ □ Regular testing of DR site │ │| │ □ Monitor replication status │ │| │ □ Keep DR site powered and cooled │ │| └─────────────────────────────────────────────────────────────┘ │| || ┌─────────────────────────────────────────────────────────────┐ │| │ PROCEDURES │ │| ├─────────────────────────────────────────────────────────────┤ │| │ □ Document clear runbooks │ │| │ □ Automate where possible │ │| │ □ Regular testing (monthly tabletop) │ │| │ □ Annual full failover test │ │| │ □ Clear escalation procedures │ │| └─────────────────────────────────────────────────────────────┘ │| || ┌─────────────────────────────────────────────────────────────┐ │| │ COMMUNICATION │ │| ├─────────────────────────────────────────────────────────────┤ │| │ □ Stakeholder notification procedures │ │| │ □ Status update templates │ │| │ □ Post-incident communication │ │| │ □ Public relations plan (if needed) │ │| └─────────────────────────────────────────────────────────────┘ │| |+------------------------------------------------------------------+End of Chapter 46: Disaster Recovery Planning