Skip to content

HA Concepts

Backup and disaster recovery are critical for any system administrator. This chapter covers various backup strategies, tools, disaster recovery planning, and testing procedures essential for production environments.


Backup and disaster recovery are fundamental to operational reliability and business continuity. As a DevOps engineer or SRE, you’ll be responsible for:

  • Data Protection: Preventing data loss from hardware failures, human error, or cyberattacks
  • Recovery Time Objectives (RTO): Minimizing downtime during disasters
  • Recovery Point Objectives (RPO): Minimizing data loss between backups
  • Compliance: Meeting regulatory requirements for data retention
  • On-Call Responsibilities: Responding to backup failures and recovery events

Poor backup practices lead to data loss, extended outages, and potential compliance violations. Understanding backup strategies is essential for any Linux system administrator.


Backup Types
+------------------------------------------------------------------+
| |
| +------------------------+ |
| | Backup Types | |
| +------------------------+ |
| | |
| +---------------------+---------------------+ |
| | | | |
| v v v |
| +---------+ +-----------+ +-----------+ |
| | Full | |Incremental| |Differential| |
| +---------+ +-----------+ +-----------+ |
| | | | |
| v v v |
| +---------+ +-----------+ +-----------+ |
| |All Data | |Changes | |Changes | |
| | | |since last | |since last | |
| |Copy | |backup | |full backup| |
| +---------+ +-----------+ +-----------+ |
| |
+------------------------------------------------------------------+
Terminal window
# Full Backup
# - Complete copy of all data
# - Largest size, longest time
# - Easiest to restore
# Incremental Backup
# - Only changes since last backup
# - Smallest size, fastest
# - Slowest to restore (need all backups)
# Differential Backup
# - Changes since last full backup
# - Medium size
# - Faster restore than incremental
Terminal window
# 3 copies of data
# 2 different storage types
# 1 offsite copy

Terminal window
# Local backup
rsync -av /source/ /destination/
rsync -avz /source/ destination@server:/path/ # Over network
# Options explained
-a # Archive mode (preserves permissions, timestamps, etc.)
-v # Verbose
-z # Compress during transfer
-h # Human readable
-P # Show progress, resume partial
--delete # Delete files not in source
--exclude='*.log' # Exclude patterns
--dry-run # Test without making changes
Terminal window
# Backup home directory
rsync -avPh --delete /home/user/ /backup/user/
# Backup with exclude
rsync -av --exclude='node_modules/' --exclude='.git/' /app/ /backup/app/
# Remote backup over SSH
rsync -avz -e ssh /local/ user@remote:/backup/
# Preserve hard links
rsync -avH /source/ /destination/
# Bandwidth limit (KB/s)
rsync -avz --bwlimit=1000 /source/ /destination/
backup_rsync.sh
#!/bin/bash
SOURCE="/home"
DEST="/backup"
LOGFILE="/var/log/backup.log"
EMAIL="admin@example.com"
log() {
echo "[$(date)] $*" | tee -a $LOGFILE
}
log "Starting backup from $SOURCE to $DEST"
# Create destination if not exists
mkdir -p "$DEST"
# Run rsync
if rsync -avH --delete \
--exclude='.cache' \
--exclude='.local/share/Trash' \
"$SOURCE/" "$DEST/"; then
log "Backup completed successfully"
# Get backup size
du -sh "$DEST"
else
log "Backup FAILED"
echo "Backup failed" | mail -s "Backup Failed" "$EMAIL"
fi

Terminal window
# Create compressed archive
tar -czvf backup.tar.gz /path/to/dir
tar -cajvf backup.tar.bz2 /path/to/dir
tar -cavf backup.tar.xz /path/to/dir
# Create archive with date
tar -czvf "backup_$(date +%Y%m%d).tar.gz" /path/to/dir
# Exclude patterns
tar -czvf backup.tar.gz \
--exclude='*.log' \
--exclude='node_modules' \
/path/to/dir
Terminal window
# Backup and split into parts
tar -czvf - /path | split -b 1000M - backup_part_
# Backup with preserve attributes
tar -cpzvf backup.tar.gz /path
# c = create, p = preserve, z = gzip, v = verbose
# Extract
tar -xzvf backup.tar.gz
tar -xajvf backup.tar.bz2
# Extract to specific directory
tar -xzvf backup.tar.gz -C /destination/
# List contents
tar -tzvf backup.tar.gz

Terminal window
# Install restic
sudo pacman -S restic
# Initialize repository
restic init --repo /backup/repo
# Or with password file
restic init --repo /backup/repo --password-file /etc/restic_pw
# Backup
restic backup /home
restic backup /home --tag "daily"
# List backups
restic snapshots
# Restore
restic restore latest --target /restore
# Check integrity
restic check
# Remove old backups
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6
Terminal window
# S3 backend
restic -r s3:https://s3.amazonaws.com/bucket/repo
# Backblaze B2
restic -r b2:bucketname:reponame
# SFTP
restic -r sftp:user@host:/path/to/repo
# REST server
restic -r rest:https://backup.example.com:8000/repo

Terminal window
# Install duplicity
sudo pacman -S duplicity
# Backup to local file
duplicity /home file:///backup
# Backup to remote (SFTP)
duplicity /home sftp://user@host//backup
# Backup to AWS S3
duplicity /home s3://bucketname/prefix
# Full backup (full and incremental default)
duplicity full /home sftp://user@host//backup
# Incremental backup
duplicity /home sftp://user@host//backup
# Restore
duplicity sftp://user@host//backup /restore
# List files in backup
duplicity list-current-files sftp://user@host//backup
# Verify backup
duplicity verify sftp://user@host//backup /home

Terminal window
# pg_dump - single database
pg_dump -U postgres mydb > backup.sql
pg_dump -U postgres mydb | gzip > backup.sql.gz
# pg_dumpall - all databases
pg_dumpall -U postgres > all_databases.sql
# Custom format (for pg_restore)
pg_dump -U postgres -Fc mydb > backup.dump
# Parallel backup (large databases)
pg_dump -U postgres -Fd -j 4 mydb -f backup_dir
# Restore
psql -U postgres mydb < backup.sql
pg_restore -U postgres -d mydb backup.dump
# Point-in-time recovery (PITR)
# Requires WAL archiving enabled
Terminal window
# mysqldump - single database
mysqldump -u root -p mydb > backup.sql
mysqldump -u root -p mydb | gzip > backup.sql.gz
# All databases
mysqldump -u root -p --all-databases > all_databases.sql
# With options for production
mysqldump -u root -p \
--single-transaction \
--quick \
--lock-tables=false \
mydb > backup.sql
# Restore
mysql -u root -p mydb < backup.sql
# Binary log backup (for PITR)
mysqlbinlog mysql-bin.000001 > backup.sql

Terminal window
# Create btrfs snapshot
sudo btrfs subvolume snapshot / /.snapshots/$(date +%Y%m%d)
# List snapshots
btrfs subvolume list /
# Restore snapshot
sudo btrfs subvolume snapshot /.snapshots/backup /
# Delete snapshot
sudo btrfs subvolume delete /.snapshots/backup
Terminal window
# Install timeshift
sudo pacman -S timeshift
# Create snapshot
sudo timeshift --create --comments "Before update"
# List snapshots
sudo timeshift --list
# Restore
sudo timeshift --restore
# Delete snapshot
sudo timeshift --delete --snapshot '2024-01-01_00-00'

Disaster Recovery Components
+------------------------------------------------------------------+
| |
| Recovery Backups Procedures |
| +---------+ +---------+ +---------+ |
| | RTO | | Local | | Doc | |
| |Recovery | | Backup | | | |
| | Time | +---------+ +---------+ |
| |Objective| | Offsite | |Testing | |
| +---------+ | Backup | | | |
| +---------+ +---------+ +---------+ |
| | RPO | | Cloud | |Automation| |
| |Recovery | | Backup | | | |
| | Point | +---------+ +---------+ |
| |Objective| |
| +---------+ |
| |
+------------------------------------------------------------------+
Terminal window
# RTO - Recovery Time Objective
# Maximum acceptable downtime
# Example: 4 hours
# RPO - Recovery Point Objective
# Maximum acceptable data loss (time)
# Example: 24 hours (daily backups = max 24h loss)
# Disaster Recovery Plan
## Critical Systems
1. Web Application
2. Database
3. File Storage
## RTO/RPO
- Web Application: RTO 1 hour, RPO 1 hour
- Database: RTO 4 hours, RPO 24 hours
- File Storage: RTO 8 hours, RPO 24 hours
## Backup Locations
- Local: /backup/local
- Offsite: rsync to backup-server
- Cloud: AWS S3
## Recovery Procedures
### Web Application
1. Provision new server
2. Deploy from version control
3. Restore configuration
4. Update DNS
### Database
1. Provision new database server
2. Restore from latest backup
3. Verify data integrity
4. Update connection strings
### Testing Schedule
- Monthly: Full restore test
- Weekly: Backup verification

verify_backup.sh
#!/bin/bash
BACKUP_DIR="/backup"
EMAIL="admin@example.com"
MAX_AGE=25 # hours
# Check backup exists and recent
last_backup=$(find "$BACKUP_DIR" -type f -mmin -$((MAX_AGE * 60)) | head -1)
if [ -z "$last_backup" ]; then
echo "WARNING: No backup found in $MAX_AGE hours" | \
mail -s "Backup Alert" "$EMAIL"
exit 1
fi
# Verify backup can be extracted
if tar -tzf "$last_backup" > /dev/null 2>&1; then
echo "Backup verification successful"
else
echo "WARNING: Backup is corrupted" | \
mail -s "Backup Corrupted" "$EMAIL"
exit 1
fi
# Check backup size (should be non-empty)
size=$(stat -f%z "$last_backup" 2>/dev/null || stat -c%s "$last_backup")
if [ "$size" -lt 1000 ]; then
echo "WARNING: Backup suspiciously small" | \
mail -s "Backup Size Alert" "$EMAIL"
fi

/etc/rsnapshot.conf
# Install rsnapshot
sudo pacman -S rsnapshot
# Main configuration file (tab-separated, not spaces!)
snapshot_root /backup/
interval daily 7
interval weekly 4
interval monthly 3
backup /home/ localhost/
backup /etc/ localhost/
backup /var/lib/postgresql/ localhost/
# Exclude
backup_script /usr/local/bin/backup_excludes.sh excluded/
# Run backup
rsnapshot daily
rsnapshot weekly
rsnapshot monthly
/etc/cron.d/backup
# Daily at 2 AM
0 2 * * * /usr/local/bin/backup.sh
# Weekly on Sunday at 3 AM
0 3 * * 0 /usr/local/bin/backup_weekly.sh
# Monthly on 1st at 4 AM
0 4 1 * * /usr/local/bin/backup_monthly.sh

Terminal window
# 1. Assess the situation
# - What failed?
# - How much data lost?
# - What's the impact?
# 2. Start recovery
# - Provision new hardware/VM
# - Install OS
# - Restore configuration
# 3. Restore data
# - Mount backup storage
# - Extract backups
# - Verify integrity
# 4. Verify services
# - Start services
# - Check logs
# - Test functionality
# 5. Update documentation
# - Document what happened
# - Update DR plan if needed
Terminal window
# Boot from rescue media
# Check filesystem
fsck -y /dev/sda1
# Mount root
mount /dev/sda1 /mnt
mount -o bind /proc /mnt/proc
mount -o bind /dev /mnt/dev
mount -o bind /sys /mnt/sys
# Chroot
chroot /mnt /bin/bash
# Reinstall bootloader
grub-install /dev/sda
grub-mkconfig -o /boot/grub/grub.cfg
# Exit chroot
exit
umount -R /mnt
reboot

WRONG:

Terminal window
# Just set up backup cron and forget
0 2 * * * /usr/local/bin/backup.sh

CORRECT:

Terminal window
# Test restore monthly
0 2 * * * /usr/local/bin/backup.sh
# Add verification
0 3 * * 0 /usr/local/bin/test-restore.sh

Why: Backups are useless if they can’t be restored. Test regularly.


WRONG:

Terminal window
# Backup to same disk
rsync -av /data /backup/local/

CORRECT:

Terminal window
# 3-2-1 rule: 3 copies, 2 media types, 1 offsite
rsync -av /data /backup/local/
rsync -av /data backup-server:/remote/backup/
restic -r s3://backup-bucket/backups backup /

Why: Local backups don’t protect against site failures, theft, or ransomware.


WRONG:

Terminal window
# Plain text backup
tar -cvf backup.tar /data

CORRECT:

Terminal window
# Encrypted backup with GPG
tar -cvf - /data | gpg -c -o backup.tar.gpg
# OR use restic (encryption by default)
restic -r s3://backup-bucket/backups backup /

Why: Lost backup media could expose sensitive data.


WRONG:

Terminal window
# Copying database files directly
cp -r /var/lib/mysql /backup/

CORRECT:

Terminal window
# Use mysqldump for consistent backups
mysqldump -u root -p alldatabases > backup.sql
# OR use PostgreSQL pg_dump
pg_dump -U postgres alldatabases > backup.sql
# OR use LVM snapshot for consistent filesystem backup
lvcreate -s -n snap -L 10G /dev/vg00/lv_mysql

Why: Copying files while DB is running can result in corrupted/inconsistent backups.


WRONG:

Terminal window
# Overwriting same file
cp -r /data /backup/daily/

CORRECT:

# Use rotation (daily/weekly/monthly)
#!/bin/bash
DAY=$(date +%u)
[ $DAY -eq 0 ] && FREQ=weekly || FREQ=daily
rsync -av /data /backup/$FREQ/
# Keep last 7 daily, 4 weekly, 12 monthly

Why: Without rotation, you can’t recover from earlier data corruption or meet retention requirements.


In this chapter, you learned:

  • ✅ Backup strategies (full, incremental, differential)
  • ✅ 3-2-1 backup rule
  • ✅ rsync for efficient backups
  • ✅ tar for archive backups
  • ✅ Restic for modern encrypted backups
  • ✅ duplicity for encrypted remote backups
  • ✅ Database backups (PostgreSQL, MySQL)
  • ✅ System snapshots (btrfs, timeshift)
  • ✅ Disaster recovery planning
  • ✅ RTO/RPO concepts
  • ✅ Backup monitoring and verification
  • ✅ Recovery procedures

  1. Q: Explain the 3-2-1 backup rule.

    • A: The 3-2-1 rule means: 3 copies of data, on 2 different types of media, with 1 copy offsite. This ensures data protection against hardware failure, media loss, and site disasters.
  2. Q: What’s the difference between RTO and RPO?

    • A: RTO (Recovery Time Objective) is the maximum acceptable downtime - how long you can be down. RPO (Recovery Point Objective) is the maximum acceptable data loss - how much data you can afford to lose, determined by backup frequency.
  3. Q: Compare incremental vs differential backups.

    • A: Incremental backs up only changes since last backup (any level). Differential backs up changes since last full backup. Incremental is faster/smaller but harder to restore (need chain). Differential is simpler to restore but grows over time.
  1. Q: Your server crashed and you need to restore. Walk me through the process.

    • A: 1) Assess damage and determine what needs restoring, 2) Check backup availability and integrity, 3) Restore OS first if needed, 4) Restore data, 5) Verify integrity with checksums, 6) Bring services online gradually, 7) Document incident and update backups if needed.
  2. Q: How would you design a backup strategy for a 100GB MySQL database with 24-hour RPO?

    • A: Full daily dumps (overnight), binary logs enabled for point-in-time recovery, replicate to secondary site, test restores weekly, consider cloud storage for offsite copies.

Chapter 16: SSH Security and Hardening


Last Updated: February 2026