Skip to content

Disaster_recovery

Comprehensive Disaster Recovery Strategies for Linux Systems

Section titled “Comprehensive Disaster Recovery Strategies for Linux Systems”

Disaster recovery (DR) is the process of restoring IT infrastructure and operations after a catastrophic event. Unlike regular backup recovery, DR deals with site-wide failures and aims to minimize business downtime.

DISASTER RECOVERY CONCEPTS
+------------------------------------------------------------------+
| |
| ┌─────────────────────────────────────────────────────────┐ │
| │ DISASTER TYPES │ │
| │ │ │
| │ ┌───────────────┐ ┌───────────────┐ │ │
| │ │ NATURAL │ │ HUMAN │ │ │
| │ ├───────────────┤ ├───────────────┤ │ │
| │ │ - Earthquake │ │ - Cyberattack │ │ │
| │ │ - Flood │ │ - Terrorism │ │ │
| │ │ - Fire │ │ - Sabotage │ │ │
| │ │ - Lightning │ │ - Human error │ │ │
| │ │ - Storm │ │ - War │ │ │
| │ └───────────────┘ └───────────────┘ │ │
| │ │ │
| │ ┌───────────────┐ ┌───────────────┐ │ │
| │ │ TECHNICAL │ │ FACILITY │ │ │
| │ ├───────────────┤ ├───────────────┤ │ │
| │ │ - Hardware │ │ - Power outage │ │ │
| │ │ failure │ │ - HVAC failure │ │ │
| │ │ - Ransomware │ │ - Building │ │ │
| │ │ - Software │ │ damage │ │ │
| │ │ corruption │ │ - Access │ │ │
| │ │ │ │ denial │ │ │
| │ └───────────────┘ └───────────────┘ │ │
| │ │ │
| └─────────────────────────────────────────────────────────┘ │
| |
+------------------------------------------------------------------+
DISASTER RECOVERY METRICS
+------------------------------------------------------------------+
| |
| RTO - RECOVERY TIME OBJECTIVE │
| ───────────────────────────────────── │
| │
| ┌──────────────────────────────────────────────────────────┐ │
| │ │ │
| │ Disaster ─────────────────────→ System Online │ │
│ │ │ │ │ │
│ │ │ │ │ │
│ │ └──────────── RTO ─────────────┘ │ │
| │ │ │
│ │ Maximum acceptable downtime │ │
| │ │ │
| └──────────────────────────────────────────────────────────┘ │
| |
| RPO - RECOVERY POINT OBJECTIVE │
| ───────────────────────────────────── │
| │
| ┌──────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Last Backup ─────────→ Disaster ──→ Recovery │ │
│ │ │ │ │ │
│ │ └────── RPO ─────────┘ │ │
│ │ │ │
│ │ Maximum acceptable data loss (time) │ │
│ │ │ │
| └──────────────────────────────────────────────────────────┘ │
| |
| TIER CLASSIFICATION │
| ──────────────────── │
| │
| Tier 0: No DR - RTO: Days, RPO: Days │
| Tier 1: Cold Site - RTO: 24h, RPO: Days │
| Tier 2: Warm Site - RTO: 4-24h, RPO: Hours │
| Tier 3: Hot Site - RTO: Minutes, RPO: Minutes |
| Tier 4: Active-Active - RTO: Zero, RPO: Zero │
| |
+------------------------------------------------------------------+

DR SITE ARCHITECTURES
+------------------------------------------------------------------+
| |
| COLD SITE |
| ───────── │
| |
| Primary Site Cold Site │
| ┌─────────┐ ┌─────────┐ │
| │ Active │ │ Empty │ │
│ │ Systems │ │ Building │ │
| └────┬────┘ └────┬────┘ │
| │ │ |
| │ Disaster │ │
| └───────────×──────────┘ |
| │ │
| ▼ │
| ┌─────────────────────┐ │
│ │ Deploy hardware │ │
│ │ Restore from backup│ │
│ │ Bring online │ │
│ └─────────────────────┘ |
| RTO: 24-72 hours |
| |
| WARM SITE │
| ───────── │
| |
| Primary Site Warm Site │
| ┌─────────┐ ┌─────────┐ │
| │ Active │ │ Partial │ │
| │ Systems │◄───────→│ Systems │ │
│ │ │ Replica │ │ │
| └────┬────┘ └────┬────┘ │
| │ │ |
| │ Disaster │ │
| └───────────×──────────┘ |
| │ │
| ▼ │
| ┌─────────────────────┐ │
│ │ Promote to primary │ │
| │ Restore recent data │ │
| └─────────────────────┘ │
| RTO: 1-4 hours |
| |
| HOT SITE │
| ─────── │
| │
| Primary Site Hot Site │
| ┌─────────┐ ┌─────────┐ │
│ │ Active │══Sync═══→│ Mirrored│ │
│ │ Systems │ │ Systems │ │
│ └────┬────┘ └────┬────┘ │
| │ │ |
| │ Disaster │ |
| └───────────×──────────┘ |
| │ │
| ▼ │
| ┌─────────────────────┐ │
│ │ Failover automatic │ │
| │ or one-click │ |
| └─────────────────────┘ |
| RTO: Minutes │
| |
+------------------------------------------------------------------+
Terminal window
# =============================================================================
# BLOCK-LEVEL REPLICATION (DRBD)
# =============================================================================
# Install DRBD
apt-get install drbd-utils
# Configure DRBD (/etc/drbd.d/r0.res)
resource r0 {
on server1 {
device /dev/drbd0;
disk /dev/sda1;
address 192.168.1.10:7788;
meta-disk internal;
}
on server2 {
device /dev/drbd0;
disk /dev/sda1;
address 192.168.1.20:7788;
meta-disk internal;
}
}
# Initialize DRBD
drbdadm create-md r0
drbdadm up r0
# Primary (on primary site)
drbdadm primary --force r0
# Verify status
cat /proc/drbd
# =============================================================================
# DATABASE REPLICATION
# =============================================================================
# MySQL/MariaDB Master-Slave Replication
# On Master (/etc/mysql/my.cnf):
[m mysqld]
server-id = 1
log-bin = mysql-bin
binlog-do-db = myapp
# On Slave (/etc/mysql/my.cnf):
[mysqld]
server-id = 2
relay-log = relay-bin
read-only = 1
# Commands on Master:
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%' IDENTIFIED BY 'password';
FLUSH PRIVILEGES;
# Commands on Slave:
CHANGE MASTER TO
MASTER_HOST='master_ip',
MASTER_USER='repl',
MASTER_PASSWORD='password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=0;
START SLAVE;
SHOW SLAVE STATUS\G
# =============================================================================
# FILE-LEVEL REPLICATION (rsync + cron)
# =============================================================================
# Replication script
#!/bin/bash
# rsync-based file replication
SOURCE="/data"
DEST="backup@dr-site:/backup/data"
# Real-time sync with inotify
inotifywait -mrq -e create,modify,delete,move $SOURCE | while read events; do
rsync -avz --delete -e ssh $SOURCE/ $DEST/
done

DISASTER RECOVERY PLAN STRUCTURE
+------------------------------------------------------------------+
| |
| ┌─────────────────────────────────────────────────────────┐ │
| │ DR PLAN SECTIONS │ │
| │ │ │
| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
| │ │ 1. RISK │ │ 2. TEAM │ │ 3. DOCUMENT│ │ │
| │ │ ASSESSMENT│ │ ROLES │ │ INVENTORY │ │ │
| │ │ │ │ │ │ │ │ │
| │ │ Identify │ │ DR manager │ │ Systems │ │ │
| │ │ threats │ │ Technical │ │ Applications│ │ │
| │ │ │ │ support │ │ Data │ │ │
| │ │ Evaluate │ │ Communica- │ │ Dependen- │ │ │
| │ │ likelihood │ │ tions │ │ cies │ │ │
│ │ │ │ │ │ │ │ │ │
| │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
| │ │ │
| │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
| │ │ 4. BACKUP │ │ 5. FAILOVER│ │ 6. RETURN │ │ │
| │ │ STRATEGY │ │ PROCEDURES │ │ PROCEDURES │ │ │
| │ │ │ │ │ │ │ │ │
│ │ │ Replication│ │ Step-by-step│ │ Step-by-step│ │ │
│ │ │ Schedule │ │ failover │ │ failback │ │ │
| │ │ Retention │ │ Runbooks │ │ Validation │ │ │
| │ │ Testing │ │ Scripts │ │ Timeline │ │ │
| │ │ │ │ │ │ │ │ │
| │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
| │ │ │
| │ ┌─────────────────────────────────────────────────┐ │ │
| │ │ 7. TESTING & MAINTENANCE │ │ │
| │ ├─────────────────────────────────────────────────┤ │ │
| │ │ - Monthly tabletop exercises │ │ │
| │ │ - Quarterly partial failover tests │ │ │
| │ │ - Annual full failover tests │ │ │
| │ │ - Document lessons learned │ │ │
| │ │ - Update plan based on changes │ │ │
| │ └─────────────────────────────────────────────────┘ │ │
| │ │ │
| └─────────────────────────────────────────────────────────┘ │
| |
+------------------------------------------------------------------+
Terminal window
# =============================================================================
# DOCUMENTATION TEMPLATE
# =============================================================================
# Create DR plan document: /root/dr-plan.md
# Document structure:
# 1. Executive Summary
# 2. Scope and Objectives
# 3. Risk Assessment
# 4. Contact Information
# 5. System Inventory
# 6. Recovery Procedures
# 7. Runbooks
# 8. Testing Schedule
# Example system inventory format:
cat > /root/dr-inventory.csv <<'EOF'
System,Application,Owner,RTO,RPO,Critical,Dependencies
web01,nginx,admin@company.com,1h,15min,Yes,DB01
db01,mysql,admin@company.com,2h,1h,Yes,-
app01,nodejs,admin@company.com,1h,15min,Yes,DB01
EOF

Terminal window
# =============================================================================
# KEEPALIVED SETUP FOR HA
# =============================================================================
# Install keepalived
apt-get install keepalived
# /etc/keepalived/keepalived.conf on PRIMARY
global_defs {
router_id LVS_DEVEL
notification_email {
admin@example.com
}
notification_email_from keepalived@example.com
smtp_server localhost
smtp_connect_timeout 30
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass secret123
}
virtual_ipaddress {
192.168.1.100 dev eth0
}
track_script {
check_nginx
}
}
vrrp_script check_nginx {
script "/usr/local/bin/check_nginx.sh"
interval 2
weight 2
}
# /etc/keepalived/keepalived.conf on BACKUP
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 90
# ... rest same as master
}
# Nginx check script
cat > /usr/local/bin/check_nginx.sh <<'EOF'
#!/bin/bash
if ! pgrep nginx > /dev/null; then
exit 1
fi
exit 0
EOF
chmod +x /usr/local/bin/check_nginx.sh
# Start keepalived
systemctl enable keepalived
systemctl start keepalived
#!/bin/bash
#===============================================================================
# Manual Failover Script
# Run this on DR site when primary fails
#===============================================================================
set -euo pipefail
# Configuration
PRIMARY_IP="192.168.1.10"
DR_IP="192.168.1.20"
VIP="192.168.1.100"
APP_USER="appuser"
APP_DIR="/opt/application"
DB_NAME="myapp"
DB_USER="dbuser"
log() {
echo "[$(date)] $1"
}
# Step 1: Verify primary is unreachable
check_primary_down() {
log "Checking if primary is down..."
if ping -c 3 -W 2 "$PRIMARY_IP" &>/dev/null; then
log "WARNING: Primary is still reachable!"
read -p "Continue anyway? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
exit 1
fi
fi
log "Primary confirmed down"
}
# Step 2: Stop replication
stop_replication() {
log "Stopping replication..."
# Database commands to stop replication
mysql -u root -e "STOP SLAVE;"
mysql -u root -e "RESET SLAVE ALL;"
}
# Step 3: Promote DR server
promote_server() {
log "Promoting DR server..."
# Make database writable
mysql -u root -e "RESET MASTER;"
# Start application services
systemctl start nginx
systemctl start myapp
# Configure VIP on DR
ip addr add "$VIP/24" dev eth0
}
# Step 4: Update DNS
update_dns() {
log "Updating DNS..."
# Update DNS records
nsupdate -k /etc/bind/ddns.key <<EOF
server ns1.example.com
update delete app.example.com A
update add app.example.com 300 A $DR_IP
send
EOF
}
# Main
main() {
log "Starting manual failover to DR site"
check_primary_down
stop_replication
promote_server
update_dns
log "Failover completed successfully"
}
main "$@"
#!/bin/bash
#===============================================================================
# Failback Script
# Run this to return to primary site after repairs
#===============================================================================
set -euo pipefail
PRIMARY_IP="192.168.1.10"
DR_IP="192.168.1.20"
VIP="192.168.1.100"
log() {
echo "[$(date)] $1"
}
# Step 1: Sync data from DR to Primary
sync_data() {
log "Syncing data from DR to Primary..."
# Sync database
mysqldump -u root myapp | mysql -h "$PRIMARY_IP" myapp
# Sync files
rsync -avz -e ssh "$DR_IP:$APP_DIR/" "$APP_DIR/"
}
# Step 2: Stop services on DR
stop_dr_services() {
log "Stopping DR services..."
ssh "$DR_IP" "systemctl stop nginx"
ssh "$DR_IP" "systemctl stop myapp"
ssh "$DR_IP" "ip addr del $VIP/24 dev eth0"
}
# Step 3: Start services on Primary
start_primary() {
log "Starting Primary services..."
systemctl start nginx
systemctl start myapp
ip addr add "$VIP/24" dev eth0
}
# Step 4: Verify
verify() {
log "Verifying Primary..."
if ping -c 3 "$VIP" &>/dev/null; then
log "Failback completed successfully"
else
log "ERROR: Failback verification failed"
exit 1
fi
}
main() {
log "Starting failback to Primary site"
sync_data
stop_dr_services
start_primary
verify
}
main "$@"

DR TESTING TIERS
+------------------------------------------------------------------+
| |
| ┌─────────────────────────────────────────────────────────┐ │
| │ TIER 1: TABLETOP │ │
| │ │ │
| │ Frequency: Monthly │ │
| │ Duration: 2-4 hours │ │
| │ │ │
| │ - Walk through DR procedures │ │
| │ - Discuss scenarios │ │
| │ - Identify gaps │ │
| │ - No actual failover │ │
| │ │ │
| └─────────────────────────────────────────────────────────┘ │
| |
| ┌─────────────────────────────────────────────────────────┐ │
| │ TIER 2: COMPONENT TEST │ │
| │ │ │
| │ Frequency: Quarterly │ │
| │ Duration: 1-2 days │ │
| │ │ │
| │ - Test individual components │ │
| │ - Verify backup restoration │ │
| │ - Test failover of single system │ │
| │ - Validate monitoring │ │
| │ │ │
| └─────────────────────────────────────────────────────────┘ │
| │
| ┌─────────────────────────────────────────────────────────┐ │
| │ TIER 3: FULL FAILOVER │ │
| │ │ │
| │ Frequency: Annually │ │
| │ Duration: 1-2 days │ │
| │ │ │
| │ - Complete failover to DR site │ │
| │ - Run production workload │ │
| │ - Test all systems │ │
| │ - Test failback │ │
| │ │ │
| └─────────────────────────────────────────────────────────┘ │
| |
+------------------------------------------------------------------+
DR TEST CHECKLIST
+------------------------------------------------------------------+
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ PRE-TEST │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Schedule maintenance window │ │
| │ □ Notify all stakeholders │ │
| │ □ Verify backups are current │ │
| │ □ Document baseline performance │ │
| │ □ Prepare test scripts │ │
| │ □ Verify DR site is ready │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ DURING TEST │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Execute failover procedure │ │
| │ □ Verify all services start │ │
| │ □ Test application functionality │ │
| │ □ Test user access │ │
| │ □ Verify data integrity │ │
| │ □ Document any issues │ │
| │ □ Time each step │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ POST-TEST │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Execute failback procedure │ │
| │ □ Verify primary is restored │ │
| │ □ Compare performance to baseline │ │
| │ □ Document lessons learned │ │
| │ □ Update DR plan if needed │ │
| │ □ Send summary to stakeholders │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
+------------------------------------------------------------------+

DISASTER RECOVERY BEST PRACTICES
+------------------------------------------------------------------+
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ PLANNING │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Define RTO and RPO for each system │ │
| │ □ Document all dependencies │ │
| │ □ Maintain up-to-date system inventory │ │
| │ □ Regular DR plan reviews (quarterly) │ │
| │ □ Document contact information │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ INFRASTRUCTURE │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Geographic separation of DR site │ │
| │ □ Redundant network connectivity │ │
| │ □ Regular testing of DR site │ │
| │ □ Monitor replication status │ │
| │ □ Keep DR site powered and cooled │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ PROCEDURES │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Document clear runbooks │ │
| │ □ Automate where possible │ │
| │ □ Regular testing (monthly tabletop) │ │
| │ □ Annual full failover test │ │
| │ □ Clear escalation procedures │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
| ┌─────────────────────────────────────────────────────────────┐ │
| │ COMMUNICATION │ │
| ├─────────────────────────────────────────────────────────────┤ │
| │ □ Stakeholder notification procedures │ │
| │ □ Status update templates │ │
| │ □ Post-incident communication │ │
| │ □ Public relations plan (if needed) │ │
| └─────────────────────────────────────────────────────────────┘ │
| |
+------------------------------------------------------------------+

End of Chapter 46: Disaster Recovery Planning