HA Concepts
Chapter 61: Backup and Disaster Recovery
Section titled “Chapter 61: Backup and Disaster Recovery”Overview
Section titled “Overview”Backup and disaster recovery are critical for any system administrator. This chapter covers various backup strategies, tools, disaster recovery planning, and testing procedures essential for production environments.
Why This Matters in DevOps/SRE
Section titled “Why This Matters in DevOps/SRE”Backup and disaster recovery are fundamental to operational reliability and business continuity. As a DevOps engineer or SRE, you’ll be responsible for:
- Data Protection: Preventing data loss from hardware failures, human error, or cyberattacks
- Recovery Time Objectives (RTO): Minimizing downtime during disasters
- Recovery Point Objectives (RPO): Minimizing data loss between backups
- Compliance: Meeting regulatory requirements for data retention
- On-Call Responsibilities: Responding to backup failures and recovery events
Poor backup practices lead to data loss, extended outages, and potential compliance violations. Understanding backup strategies is essential for any Linux system administrator.
62.1 Backup Strategies
Section titled “62.1 Backup Strategies”Types of Backups
Section titled “Types of Backups” Backup Types+------------------------------------------------------------------+| || +------------------------+ || | Backup Types | || +------------------------+ || | || +---------------------+---------------------+ || | | | || v v v || +---------+ +-----------+ +-----------+ || | Full | |Incremental| |Differential| || +---------+ +-----------+ +-----------+ || | | | || v v v || +---------+ +-----------+ +-----------+ || |All Data | |Changes | |Changes | || | | |since last | |since last | || |Copy | |backup | |full backup| || +---------+ +-----------+ +-----------+ || |+------------------------------------------------------------------+Backup Types Explained
Section titled “Backup Types Explained”# Full Backup# - Complete copy of all data# - Largest size, longest time# - Easiest to restore
# Incremental Backup# - Only changes since last backup# - Smallest size, fastest# - Slowest to restore (need all backups)
# Differential Backup# - Changes since last full backup# - Medium size# - Faster restore than incrementalThe 3-2-1 Backup Rule
Section titled “The 3-2-1 Backup Rule”# 3 copies of data# 2 different storage types# 1 offsite copy62.2 rsync for Backups
Section titled “62.2 rsync for Backups”Basic rsync
Section titled “Basic rsync”# Local backuprsync -av /source/ /destination/rsync -avz /source/ destination@server:/path/ # Over network
# Options explained-a # Archive mode (preserves permissions, timestamps, etc.)-v # Verbose-z # Compress during transfer-h # Human readable-P # Show progress, resume partial--delete # Delete files not in source--exclude='*.log' # Exclude patterns--dry-run # Test without making changesrsync Examples
Section titled “rsync Examples”# Backup home directoryrsync -avPh --delete /home/user/ /backup/user/
# Backup with excludersync -av --exclude='node_modules/' --exclude='.git/' /app/ /backup/app/
# Remote backup over SSHrsync -avz -e ssh /local/ user@remote:/backup/
# Preserve hard linksrsync -avH /source/ /destination/
# Bandwidth limit (KB/s)rsync -avz --bwlimit=1000 /source/ /destination/rsync Script
Section titled “rsync Script”#!/bin/bashSOURCE="/home"DEST="/backup"LOGFILE="/var/log/backup.log"EMAIL="admin@example.com"
log() { echo "[$(date)] $*" | tee -a $LOGFILE}
log "Starting backup from $SOURCE to $DEST"
# Create destination if not existsmkdir -p "$DEST"
# Run rsyncif rsync -avH --delete \ --exclude='.cache' \ --exclude='.local/share/Trash' \ "$SOURCE/" "$DEST/"; then log "Backup completed successfully"
# Get backup size du -sh "$DEST"else log "Backup FAILED" echo "Backup failed" | mail -s "Backup Failed" "$EMAIL"fi62.3 tar for Backups
Section titled “62.3 tar for Backups”Creating Backups with tar
Section titled “Creating Backups with tar”# Create compressed archivetar -czvf backup.tar.gz /path/to/dirtar -cajvf backup.tar.bz2 /path/to/dirtar -cavf backup.tar.xz /path/to/dir
# Create archive with datetar -czvf "backup_$(date +%Y%m%d).tar.gz" /path/to/dir
# Exclude patternstar -czvf backup.tar.gz \ --exclude='*.log' \ --exclude='node_modules' \ /path/to/dirtar Examples
Section titled “tar Examples”# Backup and split into partstar -czvf - /path | split -b 1000M - backup_part_
# Backup with preserve attributestar -cpzvf backup.tar.gz /path
# c = create, p = preserve, z = gzip, v = verbose
# Extracttar -xzvf backup.tar.gztar -xajvf backup.tar.bz2
# Extract to specific directorytar -xzvf backup.tar.gz -C /destination/
# List contentstar -tzvf backup.tar.gz62.4 Restic - Modern Backup Tool
Section titled “62.4 Restic - Modern Backup Tool”Installing and Basic Usage
Section titled “Installing and Basic Usage”# Install resticsudo pacman -S restic
# Initialize repositoryrestic init --repo /backup/repo# Or with password filerestic init --repo /backup/repo --password-file /etc/restic_pw
# Backuprestic backup /homerestic backup /home --tag "daily"
# List backupsrestic snapshots
# Restorerestic restore latest --target /restore
# Check integrityrestic check
# Remove old backupsrestic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 6Restic Remote Backends
Section titled “Restic Remote Backends”# S3 backendrestic -r s3:https://s3.amazonaws.com/bucket/repo
# Backblaze B2restic -r b2:bucketname:reponame
# SFTPrestic -r sftp:user@host:/path/to/repo
# REST serverrestic -r rest:https://backup.example.com:8000/repo62.5 duplicity - Encrypted Backups
Section titled “62.5 duplicity - Encrypted Backups”# Install duplicitysudo pacman -S duplicity
# Backup to local fileduplicity /home file:///backup
# Backup to remote (SFTP)duplicity /home sftp://user@host//backup
# Backup to AWS S3duplicity /home s3://bucketname/prefix
# Full backup (full and incremental default)duplicity full /home sftp://user@host//backup
# Incremental backupduplicity /home sftp://user@host//backup
# Restoreduplicity sftp://user@host//backup /restore
# List files in backupduplicity list-current-files sftp://user@host//backup
# Verify backupduplicity verify sftp://user@host//backup /home62.6 Database Backups
Section titled “62.6 Database Backups”PostgreSQL Backup
Section titled “PostgreSQL Backup”# pg_dump - single databasepg_dump -U postgres mydb > backup.sqlpg_dump -U postgres mydb | gzip > backup.sql.gz
# pg_dumpall - all databasespg_dumpall -U postgres > all_databases.sql
# Custom format (for pg_restore)pg_dump -U postgres -Fc mydb > backup.dump
# Parallel backup (large databases)pg_dump -U postgres -Fd -j 4 mydb -f backup_dir
# Restorepsql -U postgres mydb < backup.sqlpg_restore -U postgres -d mydb backup.dump
# Point-in-time recovery (PITR)# Requires WAL archiving enabledMySQL/MariaDB Backup
Section titled “MySQL/MariaDB Backup”# mysqldump - single databasemysqldump -u root -p mydb > backup.sqlmysqldump -u root -p mydb | gzip > backup.sql.gz
# All databasesmysqldump -u root -p --all-databases > all_databases.sql
# With options for productionmysqldump -u root -p \ --single-transaction \ --quick \ --lock-tables=false \ mydb > backup.sql
# Restoremysql -u root -p mydb < backup.sql
# Binary log backup (for PITR)mysqlbinlog mysql-bin.000001 > backup.sql62.7 System Backup
Section titled “62.7 System Backup”Systemd-snapshot
Section titled “Systemd-snapshot”# Create btrfs snapshotsudo btrfs subvolume snapshot / /.snapshots/$(date +%Y%m%d)
# List snapshotsbtrfs subvolume list /
# Restore snapshotsudo btrfs subvolume snapshot /.snapshots/backup /
# Delete snapshotsudo btrfs subvolume delete /.snapshots/backupTimeshift (for Arch)
Section titled “Timeshift (for Arch)”# Install timeshiftsudo pacman -S timeshift
# Create snapshotsudo timeshift --create --comments "Before update"
# List snapshotssudo timeshift --list
# Restoresudo timeshift --restore
# Delete snapshotsudo timeshift --delete --snapshot '2024-01-01_00-00'62.8 Disaster Recovery Planning
Section titled “62.8 Disaster Recovery Planning”DR Strategy Components
Section titled “DR Strategy Components” Disaster Recovery Components+------------------------------------------------------------------+| || Recovery Backups Procedures || +---------+ +---------+ +---------+ || | RTO | | Local | | Doc | || |Recovery | | Backup | | | || | Time | +---------+ +---------+ || |Objective| | Offsite | |Testing | || +---------+ | Backup | | | || +---------+ +---------+ +---------+ || | RPO | | Cloud | |Automation| || |Recovery | | Backup | | | || | Point | +---------+ +---------+ || |Objective| || +---------+ || |+------------------------------------------------------------------+RTO and RPO
Section titled “RTO and RPO”# RTO - Recovery Time Objective# Maximum acceptable downtime# Example: 4 hours
# RPO - Recovery Point Objective# Maximum acceptable data loss (time)# Example: 24 hours (daily backups = max 24h loss)DR Document Template
Section titled “DR Document Template”# Disaster Recovery Plan
## Critical Systems1. Web Application2. Database3. File Storage
## RTO/RPO- Web Application: RTO 1 hour, RPO 1 hour- Database: RTO 4 hours, RPO 24 hours- File Storage: RTO 8 hours, RPO 24 hours
## Backup Locations- Local: /backup/local- Offsite: rsync to backup-server- Cloud: AWS S3
## Recovery Procedures
### Web Application1. Provision new server2. Deploy from version control3. Restore configuration4. Update DNS
### Database1. Provision new database server2. Restore from latest backup3. Verify data integrity4. Update connection strings
### Testing Schedule- Monthly: Full restore test- Weekly: Backup verification62.9 Backup Monitoring
Section titled “62.9 Backup Monitoring”Backup Verification Script
Section titled “Backup Verification Script”#!/bin/bashBACKUP_DIR="/backup"EMAIL="admin@example.com"MAX_AGE=25 # hours
# Check backup exists and recentlast_backup=$(find "$BACKUP_DIR" -type f -mmin -$((MAX_AGE * 60)) | head -1)
if [ -z "$last_backup" ]; then echo "WARNING: No backup found in $MAX_AGE hours" | \ mail -s "Backup Alert" "$EMAIL" exit 1fi
# Verify backup can be extractedif tar -tzf "$last_backup" > /dev/null 2>&1; then echo "Backup verification successful"else echo "WARNING: Backup is corrupted" | \ mail -s "Backup Corrupted" "$EMAIL" exit 1fi
# Check backup size (should be non-empty)size=$(stat -f%z "$last_backup" 2>/dev/null || stat -c%s "$last_backup")if [ "$size" -lt 1000 ]; then echo "WARNING: Backup suspiciously small" | \ mail -s "Backup Size Alert" "$EMAIL"fi62.10 Complete Backup Solution
Section titled “62.10 Complete Backup Solution”rsnapshot Configuration
Section titled “rsnapshot Configuration”# Install rsnapshotsudo pacman -S rsnapshot
# Main configuration file (tab-separated, not spaces!)
snapshot_root /backup/
interval daily 7interval weekly 4interval monthly 3
backup /home/ localhost/backup /etc/ localhost/backup /var/lib/postgresql/ localhost/
# Excludebackup_script /usr/local/bin/backup_excludes.sh excluded/
# Run backuprsnapshot dailyrsnapshot weeklyrsnapshot monthlyCron Jobs for Backups
Section titled “Cron Jobs for Backups”# Daily at 2 AM0 2 * * * /usr/local/bin/backup.sh
# Weekly on Sunday at 3 AM0 3 * * 0 /usr/local/bin/backup_weekly.sh
# Monthly on 1st at 4 AM0 4 1 * * /usr/local/bin/backup_monthly.sh62.11 Restoring from Disaster
Section titled “62.11 Restoring from Disaster”Recovery Procedures
Section titled “Recovery Procedures”# 1. Assess the situation# - What failed?# - How much data lost?# - What's the impact?
# 2. Start recovery# - Provision new hardware/VM# - Install OS# - Restore configuration
# 3. Restore data# - Mount backup storage# - Extract backups# - Verify integrity
# 4. Verify services# - Start services# - Check logs# - Test functionality
# 5. Update documentation# - Document what happened# - Update DR plan if neededEmergency Boot Recovery
Section titled “Emergency Boot Recovery”# Boot from rescue media# Check filesystemfsck -y /dev/sda1
# Mount rootmount /dev/sda1 /mntmount -o bind /proc /mnt/procmount -o bind /dev /mnt/devmount -o bind /sys /mnt/sys
# Chrootchroot /mnt /bin/bash
# Reinstall bootloadergrub-install /dev/sdagrub-mkconfig -o /boot/grub/grub.cfg
# Exit chrootexitumount -R /mntrebootCommon Mistakes & Anti-Patterns
Section titled “Common Mistakes & Anti-Patterns”1. Not Testing Backups
Section titled “1. Not Testing Backups”WRONG:
# Just set up backup cron and forget0 2 * * * /usr/local/bin/backup.shCORRECT:
# Test restore monthly0 2 * * * /usr/local/bin/backup.sh# Add verification0 3 * * 0 /usr/local/bin/test-restore.shWhy: Backups are useless if they can’t be restored. Test regularly.
2. Keeping All Backups Locally
Section titled “2. Keeping All Backups Locally”WRONG:
# Backup to same diskrsync -av /data /backup/local/CORRECT:
# 3-2-1 rule: 3 copies, 2 media types, 1 offsitersync -av /data /backup/local/rsync -av /data backup-server:/remote/backup/restic -r s3://backup-bucket/backups backup /Why: Local backups don’t protect against site failures, theft, or ransomware.
3. Not Encrypting Sensitive Backups
Section titled “3. Not Encrypting Sensitive Backups”WRONG:
# Plain text backuptar -cvf backup.tar /dataCORRECT:
# Encrypted backup with GPGtar -cvf - /data | gpg -c -o backup.tar.gpg# OR use restic (encryption by default)restic -r s3://backup-bucket/backups backup /Why: Lost backup media could expose sensitive data.
4. Ignoring Database Consistency
Section titled “4. Ignoring Database Consistency”WRONG:
# Copying database files directlycp -r /var/lib/mysql /backup/CORRECT:
# Use mysqldump for consistent backupsmysqldump -u root -p alldatabases > backup.sql# OR use PostgreSQL pg_dumppg_dump -U postgres alldatabases > backup.sql# OR use LVM snapshot for consistent filesystem backuplvcreate -s -n snap -L 10G /dev/vg00/lv_mysqlWhy: Copying files while DB is running can result in corrupted/inconsistent backups.
5. No Backup Rotation
Section titled “5. No Backup Rotation”WRONG:
# Overwriting same filecp -r /data /backup/daily/CORRECT:
# Use rotation (daily/weekly/monthly)#!/bin/bashDAY=$(date +%u)[ $DAY -eq 0 ] && FREQ=weekly || FREQ=dailyrsync -av /data /backup/$FREQ/# Keep last 7 daily, 4 weekly, 12 monthlyWhy: Without rotation, you can’t recover from earlier data corruption or meet retention requirements.
Summary
Section titled “Summary”In this chapter, you learned:
- ✅ Backup strategies (full, incremental, differential)
- ✅ 3-2-1 backup rule
- ✅ rsync for efficient backups
- ✅ tar for archive backups
- ✅ Restic for modern encrypted backups
- ✅ duplicity for encrypted remote backups
- ✅ Database backups (PostgreSQL, MySQL)
- ✅ System snapshots (btrfs, timeshift)
- ✅ Disaster recovery planning
- ✅ RTO/RPO concepts
- ✅ Backup monitoring and verification
- ✅ Recovery procedures
Interview Questions
Section titled “Interview Questions”Conceptual Questions
Section titled “Conceptual Questions”-
Q: Explain the 3-2-1 backup rule.
- A: The 3-2-1 rule means: 3 copies of data, on 2 different types of media, with 1 copy offsite. This ensures data protection against hardware failure, media loss, and site disasters.
-
Q: What’s the difference between RTO and RPO?
- A: RTO (Recovery Time Objective) is the maximum acceptable downtime - how long you can be down. RPO (Recovery Point Objective) is the maximum acceptable data loss - how much data you can afford to lose, determined by backup frequency.
-
Q: Compare incremental vs differential backups.
- A: Incremental backs up only changes since last backup (any level). Differential backs up changes since last full backup. Incremental is faster/smaller but harder to restore (need chain). Differential is simpler to restore but grows over time.
Scenario-Based Questions
Section titled “Scenario-Based Questions”-
Q: Your server crashed and you need to restore. Walk me through the process.
- A: 1) Assess damage and determine what needs restoring, 2) Check backup availability and integrity, 3) Restore OS first if needed, 4) Restore data, 5) Verify integrity with checksums, 6) Bring services online gradually, 7) Document incident and update backups if needed.
-
Q: How would you design a backup strategy for a 100GB MySQL database with 24-hour RPO?
- A: Full daily dumps (overnight), binary logs enabled for point-in-time recovery, replicate to secondary site, test restores weekly, consider cloud storage for offsite copies.
Next Chapter
Section titled “Next Chapter”Chapter 16: SSH Security and Hardening
Last Updated: February 2026