Skip to content

Disaster Recovery Planning

Comprehensive Disaster Recovery Strategies for Linux Systems

Section titled “Comprehensive Disaster Recovery Strategies for Linux Systems”

Disaster recovery (DR) is the process of restoring IT infrastructure and operations after a catastrophic event. Unlike regular backup recovery, DR deals with site-wide failures and aims to minimize business downtime.

DISASTER RECOVERY CONCEPTS
+------------------------------------------------------------------+
| |
| ┌─────────────────────────────────────────────────────────┐ │
| │ DISASTER TYPES │ │
| │ │ │
| │ ┌───────────────┐ ┌───────────────┐ │ │
| │ │ NATURAL │ │ HUMAN │ │ │
| │ ├───────────────┤ ├───────────────┤ │ │
| │ │ - Earthquake │ │ - Cyberattack │ │ │
| │ │ - Flood │ │ - Terrorism │ │ │
| │ │ - Fire │ │ - Sabotage │ │ │
| │ │ - Lightning │ │ - Human error │ │ │
| │ │ - Storm │ │ - War │ │ │
| │ └───────────────┘ └───────────────┘ │ │
| │ │ │
| │ ┌───────────────┐ ┌───────────────┐ │ │
| │ │ TECHNICAL │ │ FACILITY │ │ │
| │ ├───────────────┤ ├───────────────┤ │ │
| │ │ - Hardware │ │ - Power outage │ │ │
| │ │ failure │ │ - HVAC failure │ │ │
| │ │ - Ransomware │ │ - Building │ │ │
| │ │ - Software │ │ damage │ │ │
| │ │ corruption │ │ - Access │ │ │
| │ │ │ │ denial │ │ │
| │ └───────────────┘ └───────────────┘ │ │
| │ │ │
| └─────────────────────────────────────────────────────────┘ │
| |
+------------------------------------------------------------------+
DISASTER RECOVERY METRICS
+------------------------------------------------------------------+
| |
| RTO - RECOVERY TIME OBJECTIVE │
| ───────────────────────────────────── │
| │
| ┌──────────────────────────────────────────────────────────┐ │
| │ │ │
| │ Disaster ─────────────────────→ System Online │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ │ └──────────── RTO ─────────────┘ │ │
| │ │ │
│ │ Maximum acceptable downtime │ │
| │ │ │
| └──────────────────────────────────────────────────────────┘ │
| |
| RPO - RECOVERY POINT OBJECTIVE │
| ───────────────────────────────────── │
| │
| ┌──────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Last Backup ─────────→ Disaster ──→ Recovery │ │
│ │ │ │ │ │
│ │ └────── RPO ─────────┘ │ │
│ │ │ │
│ │ Maximum acceptable data loss (time) │ │
│ │ │ │
| └──────────────────────────────────────────────────────────┘ │
| |
| TIER CLASSIFICATION │
| ──────────────────── │
| │
| Tier 0: No DR - RTO: Days, RPO: Days │
| Tier 1: Cold Site - RTO: 24h, RPO: Days │
| Tier 2: Warm Site - RTO: 4-24h, RPO: Hours │
| Tier 3: Hot Site - RTO: Minutes, RPO: Minutes |
| Tier 4: Active-Active - RTO: Zero, RPO: Zero │
| |
+------------------------------------------------------------------+

DR SITE ARCHITECTURES
+------------------------------------------------------------------+
| |
| COLD SITE |
| ───────── │
| |
| Primary Site Cold Site │
| ┌─────────┐ ┌─────────┐ │
| │ Active │ │ Empty │ │
│ │ Systems │ │ Building │ │
| └────┬────┘ └────┬────┘ │
| │ │ |
| │ Disaster │ │
| └───────────×──────────┘ |
| │ │
| ▼ │
| ┌─────────────────────┐ │
│ │ Deploy hardware │ │
│ │ Restore from backup│ │
│ │ Bring online │ │
│ └─────────────────────┘ |
| RTO: 24-72 hours |
| |
| WARM SITE │
| ───────── │
| |
| Primary Site Warm Site │
| ┌─────────┐ ┌─────────┐ │
| │ Active │ │ Partial │ │
| │ Systems │◄───────→│ Systems │ │
│ │ │ Replica │ │ │
| └────┬────┘ └────┬────┘ │
| │ │ |
| │ Disaster │ │
| └───────────×──────────┘ |
| │ │
| ▼ │
| ┌─────────────────────┐ │
│ │ Promote to primary │ │
| │ Restore recent data │ │
| └─────────────────────┘ │
| RTO: 1-4 hours |
| |
| HOT SITE │
| ─────── │
| │
| Primary Site Hot Site │
| ┌─────────┐ ┌─────────┐ │
│ │ Active │══Sync═══→│ Mirrored│ │
│ │ Systems │ │ Systems │ │
│ └────┬────┘ └────┬────┘ │
| │ │ |
| │ Disaster │ |
| └───────────×──────────┘ |
| │ │
| ▼ │
| ┌─────────────────────┐ │
│ │ Failover automatic │ │
| │ or one-click │ |
| └─────────────────────┘ |
| RTO: Minutes │
| |
+------------------------------------------------------------------+
Terminal window
# =============================================================================
# BLOCK-LEVEL REPLICATION (DRBD)
# =============================================================================
# Install DRBD
apt-get install drbd-utils
# Configure DRBD (/etc/drbd.d/r0.res)
resource r0 {
on server1 {
device /dev/drbd0;
disk /dev/sda1;
address 192.168.1.10:7788;
meta-disk internal;
}
on server2 {
device /dev/drbd0;
disk /dev/sda1;
address 192.168.1.20:7788;
meta-disk internal;
}
}
# Initialize DRBD
drbdadm create-md r0
drbdadm up r0
# Primary (on primary site)
drbdadm primary --force r0
# Verify status
cat /proc/drbd
# =============================================================================
# DATABASE REPLICATION
# =============================================================================
# MySQL/MariaDB Master-Slave Replication
# On Master (/etc/mysql/my.cnf):
[m mysqld]
server-id = 1
log-bin = mysql-bin
binlog-do-db = myapp
# On Slave (/etc/mysql/my.cnf):
[mysqld]
server-id = 2
relay-log = relay-bin
read-only = 1
# Commands on Master:
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%' IDENTIFIED BY 'password';
FLUSH PRIVILEGES;
# Commands on Slave:
CHANGE MASTER TO
MASTER_HOST='master_ip',
MASTER_USER='repl',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=0;
START SLAVE;
SHOW SLAVE STATUS\G
# =============================================================================
# FILE-LEVEL REPLICATION (rsync + cron)
# =============================================================================
# Replication script
#!/bin/bash
# rsync-based file replication
SOURCE="/data"
DEST="backup@dr-site:/backup/data"
# Real-time sync with inotify
inotifywait -mrq -e create,modify,delete,move $SOURCE | while read events; do
rsync -avz --delete -e ssh $SOURCE/ $DEST/
done

DISASTER RECOVERY PLAN STRUCTURE
+------------------------------------------------------------------+
| |
| ┌─────────────────────────────────────────────────────────┐ │
| │ DR PLAN SECTIONS │ │
| │ │ │
| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
| │ │ 1. RISK │ │ 2. TEAM │ │ 3. DOCUMENT│ │ │
| │ │ ASSESSMENT│ │ ROLES │ │ INVENTORY │ │ │
| │ │ │ │ │ │ │ │ │
| │ │ Identify │ │ DR manager │ │ Systems │ │ │
| │ │ threats │ │ Technical │ │ Applications│ │ │
| │ │ │ │ support │ │ Data │ │ │
| │ │ Evaluate │ │ Communica- │ │ Dependen- │ │ │
| │ │ likelihood │ │ tions │ │ cies │ │ │
│ │ │ │ │ │ │ │ │ │
| │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
| │ │ │
| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
| │ │ 4. BACKUP │ │ 5. FAILOVER│ │ 6. RETURN │ │ │
| │ │ STRATEGY │ │ PROCEDURES │ │ PROCEDURES │ │ │
| │ │ │ │ │ │ │ │ │
│ │ │ Replication│ │ Step-by-step│ │ Step-by-step│ │ │
│ │ │ Schedule │ │ failover │ │ failback │ │ │
| │ │ Retention │ │ Runbooks │ │ Validation │ │ │
| │ │ Testing │ │ Scripts │ │ Timeline │ │ │
| │ │ │ │ │ │ │ │ │
| │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
| │ │ │
| │ ┌─────────────────────────────────────────────────┐ │ │
| │ │ 7. TESTING & MAINTENANCE │ │ │
| │ ├─────────────────────────────────────────────────┤ │ │
| │ │ - Monthly tabletop exercises │ │ │
| │ │ - Quarterly partial failover tests │ │ │
| │ │ - Annual full failover tests │ │ │
| │ │ - Document lessons learned │ │ │
| │ │ - Update plan based on changes │ │ │
| │ └─────────────────────────────────────────────────┘ │ │
| │ │ │
| └─────────────────────────────────────────────────────────┘ │
| |
+------------------------------------------------------------------+
Terminal window
# =============================================================================
# DOCUMENTATION TEMPLATE
# =============================================================================
# Create DR plan document: /root/dr-plan.md
# Document structure:
# 1. Executive Summary
# 2. Scope and Objectives
# 3. Risk Assessment
# 4. Contact Information
# 5. System Inventory
# 6. Recovery Procedures
# 7. Runbooks
# 8. Testing Schedule
# Example system inventory format:
cat > /root/dr-inventory.csv <<'EOF'
System,Application,Owner,RTO,RPO,Critical,Dependencies
web01,nginx,admin@company.com,1h,15min,Yes,DB01
db01,mysql,admin@company.com,2h,1h,Yes,-
app01,nodejs,admin@company.com,1h,15min,Yes,DB01
EOF

Terminal window
# =============================================================================
# KEEPALIVED SETUP FOR HA
# =============================================================================
# Install keepalived
apt-get install keepalived
# /etc/keepalived/keepalived.conf on PRIMARY
global_defs {
router_id LVS_DEVEL
notification_email {
admin@example.com
}
notification_email_from keepalived@example.com
smtp_server localhost
smtp_connect_timeout 30
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass secret123
}
virtual_ipaddress {
192.168.1.100 dev eth0
}
track_script {
check_nginx
}
}
vrrp_script check_nginx {
script "/usr/local/bin/check_nginx.sh"
interval 2
weight 2
}
# /etc/keepalived/keepalived.conf on BACKUP
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 90
# ... rest same as master
}
# Nginx check script
cat > /usr/local/bin/check_nginx.sh <<'EOF'
#!/bin/bash
if ! pgrep nginx > /dev/null; then
exit 1
fi
exit 0
EOF
chmod +x /usr/local/bin/check_nginx.sh
# Start keepalived
systemctl enable keepalived
systemctl start keepalived
#!/bin/bash
#===============================================================================
# Manual Failover Script
# Run this on DR site when primary fails
#===============================================================================
set -euo pipefail
# Configuration
PRIMARY_IP="192.168.1.10"
DR_IP="192.168.1.20"
VIP="192.168.1.100"
APP_USER="appuser"
APP_DIR="/opt/application"
DB_NAME="myapp"
DB_USER="dbuser"
log() {
echo "[$(date)] $1"
}
# Step 1: Verify primary is unreachable
check_primary_down() {
log "Checking if primary is down..."
if ping -c 3 -W 2 "$PRIMARY_IP" &>/dev/null; then
log "WARNING: Primary is still reachable!"
read -p "Continue anyway? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
exit 1
fi
fi
log "Primary confirmed down"
}
# Step 2: Stop replication
stop_replication() {
log "Stopping replication..."
# Database commands to stop replication
mysql -u root -e "STOP SLAVE;"
mysql -u root -e "RESET SLAVE ALL;"
}
# Step 3: Promote DR server
promote_server() {
log "Promoting DR server..."
# Make database writable
mysql -u root -e "RESET MASTER;"
# Start application services
systemctl start nginx
systemctl start myapp
# Configure VIP on DR
ip addr add "$VIP/24" dev eth0
}
# Step 4: Update DNS
update_dns() {
log "Updating DNS..."
# Update DNS records
nsupdate -k /etc/bind/ddns.key <<EOF
server ns1.example.com
update delete app.example.com A
update add app.example.com 300 A $DR_IP
send
EOF
}
# Main
main() {
log "Starting manual failover to DR site"
check_primary_down
stop_replication
promote_server
update_dns
log "Failover completed successfully"
}
main "$@"
#!/bin/bash
#===============================================================================
# Failback Script
# Run this to return to primary site after repairs
#===============================================================================
set -euo pipefail
PRIMARY_IP="192.168.1.10"
DR_IP="192.168.1.20"
VIP="192.168.1.100"
log() {
echo "[$(date)] $1"
}
# Step 1: Sync data from DR to Primary
sync_data() {
log "Syncing data from DR to Primary..."
# Sync database
mysqldump -u root myapp | mysql -h "$PRIMARY_IP" myapp
# Sync files
rsync -avz -e ssh "$DR_IP:$APP_DIR/" "$APP_DIR/"
}
# Step 2: Stop services on DR
stop_dr_services() {
log "Stopping DR services..."
ssh "$DR_IP" "systemctl stop nginx"
ssh "$DR_IP" "systemctl stop myapp"
ssh "$DR_IP" "ip addr del $VIP/24 dev eth0"
}
# Step 3: Start services on Primary
start_primary() {
log "Starting Primary services..."
systemctl start nginx
systemctl start myapp
ip addr add "$VIP/24" dev eth0
}
# Step 4: Verify
verify() {
log "Verifying Primary..."
if ping -c 3 "$VIP" &>/dev/null; then
log "Failback completed successfully"
else
log "ERROR: Failback verification failed"
exit 1
fi
}
main() {
log "Starting failback to Primary site"
sync_data
stop_dr_services
start_primary
verify
}
main "$@"

DR TESTING TIERS
+------------------------------------------------------------------+
| |
| ┌─────────────────────────────────────────────────────────┐ │
| │ TIER 1: TABLETOP │ │
| │ │ │
| │ Frequency: Monthly │ │
| │ Duration: 2-4 hours │ │
| │ │ │
| │ - Walk through DR procedures │ │
| │ - Discuss scenarios │ │
| │ - Identify gaps │ │
| │ - No actual failover │ │
| │ │ │
| └─────────────────────────────────────────────────────────┘ │
| |
| ┌─────────────────────────────────────────────────────────┐ │
| │ TIER 2: COMPONENT TEST │ │
| │ │ │
| │ Frequency: Quarterly │ │
| │ Duration: 1-2 days │ │
| │ │ │
| │ - Test individual components │ │
| │ - Verify backup restoration │ │
| │ - Test failover of single system │ │
| │ - Validate monitoring │ │
| │ │ │
| └─────────────────────────────────────────────────────────┘ │
| │
| ┌─────────────────────────────────────────────────────────┐ │
| │ TIER 3: FULL FAILOVER │ │
| │ │ │
| │ Frequency: Annually │ │
| │ Duration: 1-2 days │ │
| │ │ │
| │ - Complete failover to DR site │ │
| │ - Run production workload │ │
| │ - Test all systems │ │
| │ - Test failback │ │
| │ │ │
| └─────────────────────────────────────────────────────────┘ │
| |
+------------------------------------------------------------------+
DR TEST CHECKLIST
+------------------------------------------------------------------+
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ PRE-TEST │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Schedule maintenance window │ │
| │ □ Notify all stakeholders │ │
| │ □ Verify backups are current │ │
| │ □ Document baseline performance │ │
| │ □ Prepare test scripts │ │
| │ □ Verify DR site is ready │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ DURING TEST │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Execute failover procedure │ │
| │ □ Verify all services start │ │
| │ □ Test application functionality │ │
| │ □ Test user access │ │
| │ □ Verify data integrity │ │
| │ □ Document any issues │ │
| │ □ Time each step │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ POST-TEST │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Execute failback procedure │ │
| │ □ Verify primary is restored │ │
| │ □ Compare performance to baseline │ │
| │ □ Document lessons learned │ │
| │ □ Update DR plan if needed │ │
| │ □ Send summary to stakeholders │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
+------------------------------------------------------------------+

DISASTER RECOVERY BEST PRACTICES
+------------------------------------------------------------------+
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ PLANNING │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Define RTO and RPO for each system │ │
| │ □ Document all dependencies │ │
| │ □ Maintain up-to-date system inventory │ │
| │ □ Regular DR plan reviews (quarterly) │ │
| │ □ Document contact information │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ INFRASTRUCTURE │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Geographic separation of DR site │ │
| │ □ Redundant network connectivity │ │
| │ □ Regular testing of DR site │ │
| │ □ Monitor replication status │ │
| │ □ Keep DR site powered and cooled │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ PROCEDURES │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Document clear runbooks │ │
| │ □ Automate where possible │ │
| │ □ Regular testing (monthly tabletop) │ │
| │ □ Annual full failover test │ │
| │ □ Clear escalation procedures │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ COMMUNICATION │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Stakeholder notification procedures │ │
| │ □ Status update templates │ │
| │ □ Post-incident communication │ │
| │ □ Public relations plan (if needed) │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
+------------------------------------------------------------------+

Disaster recovery planning is essential for business continuity:

Disaster Recovery in DevOps/SRE
+------------------------------------------------------------------+
| |
| RTO/RPO Tiers: |
| +----------------------------------------------------------+ |
| | Tier 1 -> RTO < 4h, RPO < 15min (Mission Critical) | |
| | Tier 2 -> RTO < 24h, RPO < 1h (Business Critical) | |
| | Tier 3 -> RTO < 72h, RPO < 24h (Standard) | |
| +----------------------------------------------------------+ |
| |
| Multi-Region Strategies: |
| +----------------------------------------------------------+ |
| | Active-Active -> Both regions serve traffic | |
| | Active-Passive -> Standby until failover | |
| | Pilot Light -> Minimal version of core services | |
| +----------------------------------------------------------+ |
| |
| SRE Integration: |
| +----------------------------------------------------------+ |
| | Error budgets -> Don't over-invest in DR | |
| | Chaos engineering -> Test DR regularly | |
| | Post-Mortem -> Learn from failures | |
| +----------------------------------------------------------+ |
| |
+------------------------------------------------------------------+

Practical Impact:

  • Meet business continuity requirements
  • Comply with regulations
  • Minimize financial impact of disasters

Terminal window
# WRONG: DR plan never tested
# Assumes everything will work
# First test during actual disaster = failure
# CORRECT: Regular DR tests
# Monthly tabletop exercises
# Quarterly failover tests
# Annual full DR drill
Terminal window
# WRONG: All systems in one region
# Regional outage = complete outage
# CORRECT: Multi-region architecture
# Primary/secondary regions
# Cross-region replication
Terminal window
# WRONG: Treating all systems equally
# Over-investing in non-critical systems
# Under-investing in critical systems
# CORRECT: Tier-based DR
# Critical systems first
# Cost-effective solutions

  1. What is the difference between RTO and RPO?
  2. Explain different DR strategies.
  3. What is a disaster recovery runbook?
  4. How often should you test DR plans?
  5. What is chaos engineering?


End of Chapter 46: Disaster Recovery Planning