Practical_exercises

Chapter 46: Practical Exercises

Overview

This chapter contains hands-on exercises for practicing bash scripting skills needed for DevOps, SRE, and SysAdmin positions. Each exercise includes requirements, hints, and complete solutions with production-ready code.

Exercise 1: System Monitoring Dashboard

Requirements

Create a comprehensive system monitoring script that:

Monitors CPU, Memory, Disk usage
Checks critical services status
Monitors network connectivity
Generates alerts when thresholds exceeded
Outputs formatted report
Can run continuously or once
Supports JSON output for integration with monitoring systems
Tracks historical data

Hints

Use top, free, df for metrics
Use systemctl for service checks
Use ping or curl for network
Use while loop for continuous mode
Use colors for terminal output
Use #!/usr/bin/env bash for portability
Use set -euo pipefail for safety

Solution

#!/usr/bin/env bash
#
# system-monitor.sh - Comprehensive system monitoring for production environments
#
# Usage: system-monitor.sh [OPTIONS]
#   -c, --continuous    Run continuously
#   -i, --interval N    Interval in seconds (default: 60)
#   -j, --json          Output in JSON format
#   -h, --help          Show this help
#
# Author: DevOps Engineer
# Version: 1.0.0
#

set -euo pipefail

# ============================================================================
# Configuration - Modify these values for your environment
# ============================================================================

# Thresholds (percent)
readonly THRESHOLD_CPU=80
readonly THRESHOLD_MEM=85
readonly THRESHOLD_DISK=90
readonly THRESHOLD_LOAD=2.0

# Directories
readonly STATE_DIR="/var/lib/system-monitor"
readonly LOG_DIR="/var/log"

# Services to monitor (add your critical services here)
readonly SERVICES=(
    "nginx"
    "docker"
    "sshd"
    "postgresql"
    "redis"
)

# Network hosts to check
readonly NETWORK_HOSTS=(
    "8.8.8.8"
    "1.1.1.1"
)

# Email for alerts (set empty to disable)
readonly ALERT_EMAIL="ops@example.com"

# ============================================================================
# Colors and Formatting
# ============================================================================

# Terminal colors
readonly COLOR_RED='\033[0;31m'
readonly COLOR_GREEN='\033[0;32m'
readonly COLOR_YELLOW='\033[1;33m'
readonly COLOR_BLUE='\033[0;34m'
readonly COLOR_CYAN='\033[0;36m'
readonly COLOR_BOLD='\033[1m'
readonly COLOR_NC='\033[0m' # No Color

# ============================================================================
# Global Variables
# ============================================================================

OUTPUT_JSON=false
CONTINUOUS_MODE=false
INTERVAL=60

# ============================================================================
# Utility Functions
# ============================================================================

# Print colored output
print_color() {
    local color="$1"
    local message="$2"
    echo -e "${color}${message}${COLOR_NC}"
}

# Log messages
log() {
    local level="$1"
    shift
    local message="[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $*"
    echo "$message" | tee -a "${LOG_DIR}/system-monitor.log"
}

# Send alert email
send_alert() {
    local subject="$1"
    local body="$2"

    if [[ -n "$ALERT_EMAIL" ]] && command -v mail &>/dev/null; then
        echo "$body" | mail -s "[ALERT] $subject" "$ALERT_EMAIL"
    fi
}

# Check if running as root
check_root() {
    if [[ $EUID -ne 0 ]]; then
        log "WARN" "Not running as root. Some checks may fail."
    fi
}

# Initialize directories
init_directories() {
    mkdir -p "$STATE_DIR" "$LOG_DIR"
    chmod 755 "$STATE_DIR"
}

# ============================================================================
# Metric Collection Functions
# ============================================================================

# Get CPU usage percentage
get_cpu_usage() {
    # Method 1: Using top (most compatible)
    local cpu_idle
    cpu_idle=$(top -bn2 -d 0.5 | grep "Cpu(s)" | tail -1 | awk '{print $8}' | cut -d'%' -f1)

    if [[ -n "$cpu_idle" ]]; then
        echo "100 - $cpu_idle" | bc -l
        return
    fi

    # Method 2: Using /proc/stat (Linux specific)
    if [[ -f /proc/stat ]]; then
        local cpu_line
        cpu_line=$(head -1 /proc/stat)
        local user nice system idle iowait irq softirq
        read -r user nice system idle iowait irq softirq <<< "$cpu_line"

        local total idle_total
        total=$((user + nice + system + idle + iowait + irq + softirq))
        idle_total=$((idle + iowait))

        echo "scale=2; ($total - $idle_total) * 100 / $total" | bc
        return
    fi

    echo "0"
}

# Get memory usage percentage
get_memory_usage() {
    local mem_total mem_used mem_available

    if command -v free &>/dev/null; then
        mem_total=$(free -m | awk 'NR==2{print $2}')
        mem_available=$(free -m | awk 'NR==2{print $7}')
        mem_used=$((mem_total - mem_available))
        echo "scale=2; $mem_used * 100 / $mem_total" else
        echo "0"
    | bc
    fi
}

# Get detailed memory info
get_memory_info() {
    free -h | awk 'NR==1{next} {print $1 ": " $2 " used / " $3 " free"}'
}

# Get disk usage for a path
get_disk_usage() {
    local path="${1:-/}"
    df -h "$path" | tail -1 | awk '{print $5}' | tr -d '%'
}

# Get disk info for all mounted filesystems
get_disk_info() {
    df -h | grep -E '^/dev/' | awk '{printf "%s: %s used of %s (%s)\n", $1, $3, $2, $5}'
}

# Get load average
get_load_average() {
    uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//'
}

# Get number of CPU cores
get_cpu_cores() {
    nproc 2>/dev/null || grep -c ^processor /proc/cpuinfo
}

# ============================================================================
# Service Checking Functions
# ============================================================================

# Check if a service is running
check_service() {
    local service="$1"

    if ! command -v systemctl &>/dev/null; then
        # Fallback for systems without systemd
        if pgrep -x "$service" &>/dev/null; then
            echo "running"
            return 0
        else
            echo "stopped"
            return 1
        fi
    fi

    if systemctl is-active --quiet "$service" 2>/dev/null; then
        echo "running"
        return 0
    else
        local failed_reason
        failed_reason=$(systemctl status "$service" 2>&1 | grep -i "failed" | head -1)
        echo "stopped: ${failed_reason:-unknown reason}"
        return 1
    fi
}

# Check all configured services
check_all_services() {
    local failed=0
    local services_json="["

    echo ""
    print_color "$COLOR_BOLD" "=== Service Status ==="
    echo ""

    for service in "${SERVICES[@]}"; do
        local status
        status=$(check_service "$service" 2>&1)

        if [[ "$status" == "running" ]]; then
            print_color "$COLOR_GREEN" "  ✓ $service: running"
            services_json+="{\"name\":\"$service\",\"status\":\"running\"},"
        else
            print_color "$COLOR_RED" "  ✗ $service: $status"
            services_json+="{\"name\":\"$service\",\"status\":\"$status\"},"
            ((failed++))
        fi
    done

    services_json="${services_json%,}]"

    if [[ $failed -gt 0 ]]; then
        send_alert "Service Down" "$failed service(s) are not running"
    fi

    echo "$services_json"
}

# ============================================================================
# Network Checking Functions
# ============================================================================

# Check network connectivity
check_network() {
    local failed=0
    local results_json="["

    echo ""
    print_color "$COLOR_BOLD" "=== Network Connectivity ==="
    echo ""

    for host in "${NETWORK_HOSTS[@]}"; do
        local result
        if ping -c 1 -W 2 "$host" &>/dev/null; then
            print_color "$COLOR_GREEN" "  ✓ $host: reachable"
            results_json+="{\"host\":\"$host\",\"status\":\"ok\"},"
        else
            print_color "$COLOR_RED" "  ✗ $host: unreachable"
            results_json+="{\"host\":\"$host\",\"status\":\"unreachable\"},"
            ((failed++))
        fi
    done

    # Check DNS
    local dns_result
    if host google.com &>/dev/null; then
        print_color "$COLOR_GREEN" "  ✓ DNS: working"
        results_json+="{\"host\":\"dns\",\"status\":\"ok\"},"
    else
        print_color "$COLOR_RED" "  ✗ DNS: not working"
        results_json+="{\"host\":\"dns\",\"status\":\"failed\"},"
        ((failed++))
    fi

    results_json="${results_json%,}]"

    if [[ $failed -gt 0 ]]; then
        send_alert "Network Issue" "$failed network check(s) failed"
    fi

    echo "$results_json"
}

# ============================================================================
# Process Information Functions
# ============================================================================

# Get top processes by CPU
get_top_cpu_processes() {
    echo ""
    print_color "$COLOR_BOLD" "=== Top 5 Processes by CPU ==="
    echo ""
    ps aux --sort=-%cpu | head -6 | tail -5 | \
        awk '{printf "  %-20s %5s%% CPU\n", $11, $3}'
}

# Get top processes by memory
get_top_mem_processes() {
    echo ""
    print_color "$COLOR_BOLD" "=== Top 5 Processes by Memory ==="
    echo ""
    ps aux --sort=-%mem | head -6 | tail -5 | \
        awk '{printf "  %-20s %5s%% MEM\n", $11, $4}'
}

# Get process count
get_process_count() {
    local total running sleeping stopped zombie

    total=$(ps aux | wc -l)
    running=$(ps -eo state= | grep -c R)
    sleeping=$(ps -eo state= | grep -c S)
    stopped=$(ps -eo state= | grep -c T)
    zombie=$(ps -eo state= | grep -c Z)

    echo "Total: $total | Running: $running | Sleeping: $sleeping | Stopped: $stopped | Zombie: $zombie"
}

# ============================================================================
# System Information Functions
# ============================================================================

# Get system uptime
get_uptime() {
    uptime -p 2>/dev/null || uptime | awk '{print $3,$4}' | tr -d ','
}

# Get logged in users
get_logged_users() {
    who | awk '{print $1, $2, $3}' | sort -u
}

# ============================================================================
# Alert Functions
# ============================================================================

# Check thresholds and alert
check_thresholds() {
    local cpu_usage="$1"
    local mem_usage="$2"
    local disk_usage="$3"
    local load_avg="$4"
    local cores="$5"

    local alerts=()

    # CPU check
    if (( $(echo "$cpu_usage > $THRESHOLD_CPU" | bc -l) )); then
        alerts+=("CPU usage is ${cpu_usage}% (threshold: ${THRESHOLD_CPU}%)")
    fi

    # Memory check
    if (( $(echo "$mem_usage > $THRESHOLD_MEM" | bc -l) )); then
        alerts+=("Memory usage is ${mem_usage}% (threshold: ${THRESHOLD_MEM}%)")
    fi

    # Disk check
    if [[ "$disk_usage" -gt "$THRESHOLD_DISK" ]]; then
        alerts+=("Disk usage is ${disk_usage}% (threshold: ${THRESHOLD_DISK}%)")
    fi

    # Load check (load > cores * threshold)
    local max_load
    max_load=$(echo "$cores * $THRESHOLD_LOAD" | bc)
    if (( $(echo "$load_avg > $max_load" | bc -l) )); then
        alerts+=("Load average is ${load_avg} (threshold: ${max_load} for ${cores} cores)")
    fi

    # Send alert if any threshold exceeded
    if [[ ${#alerts[@]} -gt 0 ]]; then
        local alert_message
        alert_message=$(printf "%s\n" "${alerts[@]}")
        send_alert "System Threshold Exceeded" "$alert_message"
    fi

    # Return alerts as JSON array
    local alerts_json="["
    for alert in "${alerts[@]}"; do
        alerts_json+="\"$alert\","
    done
    alerts_json="${alerts_json%,}]"
    echo "$alerts_json"
}

# ============================================================================
# Output Functions
# ============================================================================

# Generate JSON output
generate_json_output() {
    local cpu_usage="$1"
    local mem_usage="$2"
    local disk_usage="$3"
    local load_avg="$4"
    local cores="$5"
    local uptime="$6"
    local alerts="$7"
    local services="$8"
    local network="$9"

    cat <<EOF
{
  "timestamp": "$(date -Iseconds)",
  "hostname": "$(hostname)",
  "metrics": {
    "cpu": {
      "usage_percent": $cpu_usage,
      "cores": $cores
    },
    "memory": {
      "usage_percent": $mem_usage
    },
    "disk": {
      "root_usage_percent": $disk_usage
    },
    "load_average": $load_avg,
    "uptime": "$uptime"
  },
  "alerts": $alerts,
  "services": $services,
  "network": $network
}
EOF
}

# Print human-readable output
print_human_output() {
    local cpu_usage="$1"
    local mem_usage="$2"
    local disk_usage="$3"
    local load_avg="$4"
    local cores="$5"
    local uptime="$6"

    echo ""
    print_color "$COLOR_BOLD" "╔══════════════════════════════════════════════════════════╗"
    print_color "$COLOR_BOLD" "║           System Monitor - $(date '+%Y-%m-%d %H:%M:%S')            ║"
    print_color "$COLOR_BOLD" "╚══════════════════════════════════════════════════════════╝"
    echo ""

    print_color "$COLOR_BOLD" "=== System Information ==="
    echo "  Hostname: $(hostname)"
    echo "  Uptime: $uptime"
    echo "  CPU Cores: $cores"
    echo ""

    print_color "$COLOR_BOLD" "=== Resource Usage ==="

    # CPU
    local cpu_color="$COLOR_GREEN"
    if (( $(echo "$cpu_usage > $THRESHOLD_CPU" | bc -l) )); then
        cpu_color="$COLOR_RED"
    elif (( $(echo "$cpu_usage > $((THRESHOLD_CPU / 2))" | bc -l) )); then
        cpu_color="$COLOR_YELLOW"
    fi
    printf "  CPU Usage:    ${cpu_color}%6.1f%%${COLOR_NC} (threshold: %d%%)\n" "$cpu_usage" "$THRESHOLD_CPU"

    # Memory
    local mem_color="$COLOR_GREEN"
    if (( $(echo "$mem_usage > $THRESHOLD_MEM" | bc -l) )); then
        mem_color="$COLOR_RED"
    elif (( $(echo "$mem_usage > $((THRESHOLD_MEM / 2))" | bc -l) )); then
        mem_color="$COLOR_YELLOW"
    fi
    printf "  Memory Usage: ${mem_color}%6.1f%%${COLOR_NC} (threshold: %d%%)\n" "$mem_usage" "$THRESHOLD_MEM"

    # Disk
    local disk_color="$COLOR_GREEN"
    if [[ "$disk_usage" -gt "$THRESHOLD_DISK" ]]; then
        disk_color="$COLOR_RED"
    elif [[ "$disk_usage" -gt $((THRESHOLD_DISK / 2)) ]]; then
        disk_color="$COLOR_YELLOW"
    fi
    printf "  Disk Usage:   ${disk_color}%6s${COLOR_NC} (threshold: %d%%)\n" "${disk_usage}%" "$THRESHOLD_DISK"

    # Load
    local max_load
    max_load=$(echo "$cores * $THRESHOLD_LOAD" | bc)
    local load_color="$COLOR_GREEN"
    if (( $(echo "$load_avg > $max_load" | bc -l) )); then
        load_color="$COLOR_RED"
    fi
    printf "  Load Average: ${load_color}%6s${COLOR_NC} (threshold: %.1f for %d cores)\n" "$load_avg" "$max_load" "$cores"

    echo ""
    echo "  Memory Details:"
    get_memory_info | while read -r line; do
        echo "    $line"
    done

    echo ""
    echo "  Disk Details:"
    get_disk_info | while read -r line; do
        echo "    $line"
    done

    get_top_cpu_processes
    get_top_mem_processes

    echo ""
    print_color "$COLOR_BOLD" "=== Process Information ==="
    echo "  $(get_process_count)"
    echo ""

    print_color "$COLOR_BOLD" "=== Logged In Users ==="
    get_logged_users | while read -r user; do
        echo "  $user"
    done
    echo ""
}

# ============================================================================
# Main Functions
# ============================================================================

# Run all checks
run_checks() {
    # Collect metrics
    local cpu_usage
    cpu_usage=$(get_cpu_usage)
    local mem_usage
    mem_usage=$(get_memory_usage)
    local disk_usage
    disk_usage=$(get_disk_usage)
    local load_avg
    load_avg=$(get_load_average)
    local cores
    cores=$(get_cpu_cores)
    local uptime
    uptime=$(get_uptime)

    # Run checks
    local services_json
    services_json=$(check_all_services)
    local network_json
    network_json=$(check_network)

    # Check thresholds and get alerts
    local alerts_json
    alerts_json=$(check_thresholds "$cpu_usage" "$mem_usage" "$disk_usage" "$load_avg" "$cores")

    # Output
    if [[ "$OUTPUT_JSON" == "true" ]]; then
        generate_json_output "$cpu_usage" "$mem_usage" "$disk_usage" \
            "$load_avg" "$cores" "$uptime" "$alerts_json" "$services_json" "$network_json"
    else
        print_human_output "$cpu_usage" "$mem_usage" "$disk_usage" \
            "$load_avg" "$cores" "$uptime"
    fi
}

# Parse command line arguments
parse_arguments() {
    while [[ $# -gt 0 ]]; do
        case $1 in
            -c|--continuous)
                CONTINUOUS_MODE=true
                shift
                ;;
            -i|--interval)
                INTERVAL="$2"
                shift 2
                ;;
            -j|--json)
                OUTPUT_JSON=true
                shift
                ;;
            -h|--help)
                echo "Usage: $0 [OPTIONS]"
                echo ""
                echo "Options:"
                echo "  -c, --continuous    Run continuously"
                echo "  -i, --interval N   Interval in seconds (default: 60)"
                echo "  -j, --json          Output in JSON format"
                echo "  -h, --help          Show this help"
                exit 0
                ;;
            *)
                echo "Unknown option: $1"
                exit 1
                ;;
        esac
    done
}

# ============================================================================
# Main Entry Point
# ============================================================================

main() {
    parse_arguments "$@"
    check_root
    init_directories

    log "INFO" "Starting system monitor"

    if [[ "$CONTINUOUS_MODE" == "true" ]]; then
        log "INFO" "Running in continuous mode (interval: ${INTERVAL}s)"
        while true; do
            run_checks
            sleep "$INTERVAL"
        done
    else
        run_checks
    fi

    log "INFO" "System monitor completed"
}

main "$@"

Running the Monitor

# Make executable
chmod +x system-monitor.sh

# Run once (human-readable output)
./system-monitor.sh

# Run once (JSON output)
./system-monitor.sh --json

# Run continuously (every 60 seconds)
./system-monitor.sh --continuous

# Run continuously with custom interval
./system-monitor.sh --continuous --interval 30

Exercise 2: Backup Automation Script

Requirements

Create a backup script that:

Backs up specified directories
Creates timestamped archives
Implements incremental backups using snapshots
Verifies backup integrity with checksums
Cleans up old backups based on retention policy
Sends notifications on completion/failure
Supports both local and remote backup (via SSH/rsync)
Provides detailed logging

Hints

Use tar for archiving
Use rsync for incremental
Use md5sum or sha256sum for verification
Use ssh with key-based authentication for remote backup
Use find with -mtime for cleanup
Use logger for system logging

Solution

#!/usr/bin/env bash
#
# backup-manager.sh - Enterprise backup automation solution
#
# Features:
#   - Full and incremental backups
#   - Local and remote backup support
#   - Integrity verification
#   - Automatic cleanup
#   - Email notifications
#   - Comprehensive logging
#
# Usage:
#   backup-manager.sh backup [source...]     Create backup
#   backup-manager.sh restore <backup> <path> Restore backup
#   backup-manager.sh list                     List available backups
#   backup-manager.sh verify <backup>         Verify backup integrity
#   backup-manager.sh cleanup                 Remove old backups
#
# Author: DevOps Engineer
# Version: 2.0.0
#

set -euo pipefail

# ============================================================================
# Configuration
# ============================================================================

# Directories
readonly BACKUP_ROOT="/backup"
readonly STATE_DIR="${BACKUP_ROOT}/.state"
readonly LOG_DIR="/var/log"

# Backup settings
readonly COMPRESSION="gzip"  # gzip, bzip2, xz, or none
readonly BACKUP_USER="backup"
readonly RETENTION_DAYS=30
readonly INCREMENTAL_RETENTION_DAYS=7

# Remote backup settings (optional)
readonly REMOTE_HOST=""
readonly REMOTE_USER="backup"
readonly REMOTE_PATH="/backup"

# Email settings
readonly ALERT_EMAIL="ops@example.com"
readonly BACKUP_EMAIL="backup-reports@example.com"

# Default sources if not provided
DEFAULT_SOURCES=(
    "/etc"
    "/home"
    "/var/www"
    "/opt"
)

# ============================================================================
# Global Variables
# ============================================================================

DRY_RUN=false
VERBOSE=false
PARALLEL_JOBS=4

# ============================================================================
# Utility Functions
# ============================================================================

log() {
    local level="$1"
    shift
    local message="[$(date '+%Y-%m-%d %H:%M:%S')] [$level] $*"
    echo "$message" | tee -a "${LOG_DIR}/backup.log"
}

log_verbose() {
    if [[ "$VERBOSE" == "true" ]]; then
        log "DEBUG" "$*"
    fi
}

error() {
    log "ERROR" "$*"
    send_notification "ERROR" "$*"
}

notify() {
    local subject="$1"
    local body="$2"

    if [[ -n "$ALERT_EMAIL" ]] && command -v mailx &>/dev/null; then
        echo "$body" | mailx -s "[BACKUP] $subject" "$ALERT_EMAIL"
    fi
}

send_notification() {
    local status="$1"
    local message="$2"
    notify "$status" "$message"
}

# Check if command exists
require_command() {
    local cmd="$1"
    if ! command -v "$cmd" &>/dev/null; then
        error "Required command not found: $cmd"
        exit 1
    fi
}

# ============================================================================
# Setup Functions
# ============================================================================

setup_directories() {
    log "Setting up backup directories..."

    mkdir -p "$BACKUP_ROOT"
    mkdir -p "$STATE_DIR"
    mkdir -p "$LOG_DIR"

    # Set permissions
    chmod 700 "$BACKUP_ROOT"
    chmod 700 "$STATE_DIR"

    log "Backup directories created at $BACKUP_ROOT"
}

# ============================================================================
# Backup Functions
# ============================================================================

# Get compression options
get_compression_args() {
    case "$COMPRESSION" in
        gzip)
            echo "-z"
            ;;
        bzip2)
            echo "-j"
            ;;
        xz)
            echo "-J"
            ;;
        none)
            echo ""
            ;;
        *)
            echo "-z"
            ;;
    esac
}

# Get compression extension
get_compression_ext() {
    case "$COMPRESSION" in
        gzip)  echo ".gz" ;;
        bzip2) echo ".bz2" ;;
        xz)    echo ".xz" ;;
        none)  echo "" ;;
        *)     echo ".gz" ;;
    esac
}

# Get file extension
get_backup_ext() {
    local ext
    ext=$(get_compression_ext)
    echo ".tar${ext}"
}

# Create incremental backup
create_incremental_backup() {
    local source="$1"
    local timestamp="$2"
    local backup_type="$3"

    local source_name
    source_name=$(basename "$source")
    local backup_dir="${BACKUP_ROOT}/${source_name}"
    local snapshot_file="${STATE_DIR}/${source_name}.snapshot"

    mkdir -p "$backup_dir"

    local backup_file="${backup_dir}/${source_name}.${timestamp}"

    log "Creating $backup_type backup of $source"

    local tar_args
    tar_args=$(get_compression_args)

    if [[ "$backup_type" == "full" ]]; then
        # Full backup - create new snapshot
        tar --listed-incremental=/dev/null $tar_args -cf "${backup_file}.full.tar.gz" -C "$(dirname "$source")" "$(basename "$source")" 2>/dev/null || {
            error "Failed to create full backup of $source"
            return 1
        }

        # Create initial snapshot
        find "$source" -type f -printf "%p\n" 2>/dev/null | sort > "${snapshot_file}.new"

    else
        # Incremental backup
        if [[ ! -f "$snapshot_file" ]]; then
            log "No snapshot found, creating full backup instead"
            create_incremental_backup "$source" "$timestamp" "full"
            return $?
        fi

        tar --listed-incremental="$snapshot_file" $tar_args -cf "${backup_file}.incr.tar.gz" -C "$(dirname "$source")" "$(basename "$source")" 2>/dev/null || {
            error "Failed to create incremental backup of $source"
            return 1
        }
    fi

    # Calculate checksum
    local checksum_file="${backup_file}.sha256"
    sha256sum "${backup_file}.tar.gz" > "$checksum_file"

    # Verify backup
    if tar -tzf "${backup_file}.tar.gz" &>/dev/null; then
        local size
        size=$(du -h "${backup_file}.tar.gz" | cut -f1)
        log "Backup created successfully: ${backup_file}.tar.gz ($size)"
        return 0
    else
        error "Backup verification failed"
        rm -f "${backup_file}.tar.gz" "$checksum_file"
        return 1
    fi
}

# Create full backup (simpler version)
create_full_backup() {
    local source="$1"
    local timestamp="$2"

    if [[ ! -d "$source" ]]; then
        error "Source directory does not exist: $source"
        return 1
    fi

    local source_name
    source_name=$(basename "$source")
    local backup_dir="${BACKUP_ROOT}/${source_name}"
    local backup_file="${backup_dir}/${source_name}.${timestamp}"

    mkdir -p "$backup_dir"

    log "Creating full backup of $source"

    # Get compression args
    local tar_args
    tar_args=$(get_compression_args)
    local ext
    ext=$(get_compression_ext)

    # Create archive
    tar $tar_args -cf "${backup_file}${ext}" -C "$(dirname "$source")" "$(basename "$source")" || {
        error "Failed to create backup archive"
        return 1
    }

    # Calculate checksum
    local checksum_file="${backup_file}${ext}.sha256"
    sha256sum "${backup_file}${ext}" > "$checksum_file"

    # Create manifest
    local manifest="${backup_file}${ext}.manifest"
    {
        echo "Backup: ${backup_file}${ext}"
        echo "Source: $source"
        echo "Date: $(date)"
        echo "Hostname: $(hostname)"
        echo ""
        echo "Files:"
        tar $tar_args -tf "${backup_file}${ext}" | wc -l
    } > "$manifest"

    # Verify backup
    if tar $tar_args -tf "${backup_file}${ext}" &>/dev/null; then
        local size
        size=$(du -h "${backup_file}${ext}" | cut -f1)
        log "Backup created successfully: ${backup_file}${ext} ($size)"
        return 0
    else
        error "Backup verification failed"
        rm -f "${backup_file}${ext}" "$checksum_file" "$manifest"
        return 1
    fi
}

# Backup to remote server
create_remote_backup() {
    local source="$1"

    if [[ -z "$REMOTE_HOST" ]]; then
        error "Remote host not configured"
        return 1
    fi

    require_command "rsync"

    log "Creating remote backup of $source to ${REMOTE_HOST}:${REMOTE_PATH}"

    local source_name
    source_name=$(basename "$source")

    rsync -avz --progress \
        -e "ssh -o StrictHostKeyChecking=yes" \
        --delete \
        "$source/" "${REMOTE_USER}@${REMOTE_HOST}:${REMOTE_PATH}/${source_name}/" || {
            error "Remote backup failed"
            return 1
        }

    log "Remote backup completed"
    return 0
}

# ============================================================================
# Restore Functions
# ============================================================================

# List available backups
list_backups() {
    echo "=== Available Backups ==="
    echo ""

    if [[ ! -d "$BACKUP_ROOT" ]]; then
        echo "No backups found"
        return
    fi

    echo "Backup Directory: $BACKUP_ROOT"
    echo ""

    for backup_dir in "$BACKUP_ROOT"/*; do
        if [[ -d "$backup_dir" ]]; then
            local name
            name=$(basename "$backup_dir")
            echo "Source: $name"

            local count size
            count=$(find "$backup_dir" -name "*.tar*" | wc -l)
            size=$(du -sh "$backup_dir" | cut -f1)

            echo "  Files: $count"
            echo "  Total Size: $size"
            echo "  Backups:"

            find "$backup_dir" -name "*.tar*" -type f -printf "    %f (%Ts)\n" | sort -r
            echo ""
        fi
    done
}

# Verify backup integrity
verify_backup() {
    local backup_file="$1"

    log "Verifying backup: $backup_file"

    if [[ ! -f "$backup_file" ]]; then
        error "Backup file not found: $backup_file"
        return 1
    fi

    # Check checksum
    local checksum_file="${backup_file}.sha256"
    if [[ -f "$checksum_file" ]]; then
        if sha256sum -c "$checksum_file" &>/dev/null; then
            log "Checksum verified: OK"
        else
            error "Checksum verification FAILED"
            return 1
        fi
    else
        log "Warning: No checksum file found"
    fi

    # Check archive integrity
    local ext
    ext="${backup_file##*.}"

    case "$ext" in
        gz)
            if gzip -t "$backup_file" 2>/dev/null; then
                log "Archive integrity: OK"
            else
                error "Archive integrity check FAILED"
                return 1
            fi
            ;;
        bz2)
            if bzip2 -t "$backup_file" 2>/dev/null; then
                log "Archive integrity: OK"
            else
                error "Archive integrity check FAILED"
                return 1
            fi
            ;;
        xz)
            if xz -t "$backup_file" 2>/dev/null; then
                log "Archive integrity: OK"
            else
                error "Archive integrity check FAILED"
                return 1
            fi
            ;;
    esac

    log "Backup verification completed successfully"
    return 0
}

# Restore backup
restore_backup() {
    local backup_file="$1"
    local restore_path="${2:-/tmp/restore}"

    log "Restoring backup: $backup_file"
    log "Restore path: $restore_path"

    if [[ ! -f "$backup_file" ]]; then
        error "Backup file not found: $backup_file"
        return 1
    fi

    # Verify first
    if ! verify_backup "$backup_file"; then
        error "Cannot restore - backup verification failed"
        return 1
    fi

    # Create restore directory
    mkdir -p "$restore_path"

    # Extract
    local ext
    ext="${backup_file##*.}"

    case "$ext" in
        gz)
            tar -xzf "$backup_file" -C "$restore_path"
            ;;
        bz2)
            tar -xjf "$backup_file" -C "$restore_path"
            ;;
        xz)
            tar -xJf "$backup_file" -C "$restore_path"
            ;;
        *)
            tar -xf "$backup_file" -C "$restore_path"
            ;;
    esac

    log "Restore completed to $restore_path"
    return 0
}

# ============================================================================
# Cleanup Functions
# ============================================================================

# Cleanup old backups
cleanup_old_backups() {
    log "Starting cleanup of backups older than $RETENTION_DAYS days"

    local deleted_count=0
    local deleted_size=0

    while IFS= read -r -d '' file; do
        local size
        size=$(du -b "$file" | cut -f1)

        rm -f "$file"

        # Also remove related files (checksum, manifest)
        rm -f "${file}.sha256" 2>/dev/null
        rm -f "${file}.manifest" 2>/dev/null

        ((deleted_count++))
        ((deleted_size+=size))

        log "Deleted: $file"
    done < <(find "$BACKUP_ROOT" -name "*.tar*" -type f -mtime +"$RETENTION_DAYS" -print0)

    local deleted_size_human
    deleted_size_human=$((deleted_size / 1024 / 1024))

    log "Cleanup completed: $deleted_count files deleted (${deleted_size_human}MB)"

    send_notification "Cleanup Complete" "Deleted $deleted_count old backup(s)"
}

# ============================================================================
# Main Functions
# ============================================================================

# Perform backup
do_backup() {
    local sources=("${@:-${DEFAULT_SOURCES[@]}}")
    local timestamp
    timestamp=$(date +%Y%m%d_%H%M%S)

    log "Starting backup process"
    log "Sources: ${sources[*]}"

    setup_directories

    local failed=0
    local success=0

    for source in "${sources[@]}"; do
        if [[ -d "$source" ]]; then
            if create_full_backup "$source" "$timestamp"; then
                ((success++))
            else
                ((failed++))
            fi
        else
            log "Warning: Source not found or not a directory: $source"
            ((failed++))
        fi
    done

    # Remote backup (if configured)
    if [[ -n "$REMOTE_HOST" ]]; then
        for source in "${sources[@]}"; do
            if create_remote_backup "$source"; then
                log "Remote backup success: $source"
            else
                log "Remote backup failed: $source"
            fi
        done
    fi

    # Cleanup old backups
    cleanup_old_backups

    if [[ $failed -eq 0 ]]; then
        log "Backup completed successfully"
        send_notification "Backup Success" "All $success backup(s) completed successfully"
    else
        error "Backup completed with $failed failure(s)"
    fi
}

# Main entry point
main() {
    local command="${1:-backup}"
    shift || true

    case "$command" in
        backup)
            do_backup "$@"
            ;;
        restore)
            if [[ $# -lt 2 ]]; then
                echo "Usage: $0 restore <backup_file> <restore_path>"
                exit 1
            fi
            restore_backup "$1" "$2"
            ;;
        list)
            list_backups
            ;;
        verify)
            verify_backup "$1"
            ;;
        cleanup)
            cleanup_old_backups
            ;;
        *)
            echo "Usage: $0 {backup|restore|list|verify|cleanup}"
            echo ""
            echo "Commands:"
            echo "  backup [sources...]   Create backup (default sources if none specified)"
            echo "  restore <file> <path>  Restore backup"
            echo "  list                  List available backups"
            echo "  verify <file>         Verify backup integrity"
            echo "  cleanup               Remove old backups"
            exit 1
            ;;
    esac
}

main "$@"

Running the Backup

# Make executable
chmod +x backup-manager.sh

# Create backup
sudo ./backup-manager.sh backup /etc /home

# List backups
./backup-manager.sh list

# Verify a backup
./backup-manager.sh verify /backup/etc/etc.20240222.tar.gz

# Restore backup
sudo ./backup-manager.sh restore /backup/etc/etc.20240222.tar.gz /tmp/restore

# Cleanup old backups
sudo ./backup-manager.sh cleanup

Exercise 3: Log Analyzer

Requirements

Create a log analysis script that:

Parses multiple log files (syslog, nginx, application logs)
Extracts error patterns (ERROR, WARNING, FATAL, CRITICAL)
Generates statistics (error counts, most common errors)
Creates time-based analysis (hourly, daily distribution)
Filters by severity level
Outputs formatted report (text, JSON, HTML)
Supports real-time monitoring with tail -f
Can search with regex patterns
Tracks patterns over time

Solution

#!/usr/bin/env bash
#
# log-analyzer.sh - Production log analysis tool
#
# Features:
#   - Multi-format log parsing
#   - Pattern detection and statistics
#   - Time-based analysis
#   - Real-time monitoring
#   - Multiple output formats
#   - Regex search support
#
# Usage:
#   log-analyzer.sh analyze <log_file>           Analyze log file
#   log-analyzer.sh monitor <log_file>           Real-time monitoring
#   log-analyzer.sh search <pattern> <log_file>  Search pattern
#   log-analyzer.sh stats <log_file>             Show statistics
#   log-analyzer.sh errors <log_file>            Show error summary
#
# Author: DevOps Engineer
# Version: 2.0.0
#

set -euo pipefail

# ============================================================================
# Configuration
# ============================================================================

# Log formats
readonly LOG_FORMAT_SYSLOG="%b %d %H:%M:%S %h %s"
readonly LOG_FORMAT_NGINX='%h %l %u %t "%r" %>s %b'
readonly LOG_FORMAT_APACHE='"%r" %>s %b'

# Patterns to detect
readonly ERROR_PATTERNS=(
    "ERROR"
    "FATAL"
    "CRITICAL"
    "CRASH"
    "EXCEPTION"
    "FAILED"
)

readonly WARNING_PATTERNS=(
    "WARNING"
    "WARN"
    "ALERT"
)

# Output settings
OUTPUT_FORMAT="text"
REPORT_FILE=""

# Colors
readonly COLOR_RED='\033[0;31m'
readonly COLOR_GREEN='\033[0;32m'
readonly COLOR_YELLOW='\033[1;33m'
readonly COLOR_BLUE='\033[0;34m'
readonly COLOR_CYAN='\033[0;36m'
readonly COLOR_BOLD='\033[1m'
readonly COLOR_NC='\033[0m'

# ============================================================================
# Utility Functions
# ============================================================================

print_color() {
    local color="$1"
    shift
    echo -e "${color}$*${COLOR_NC}"
}

log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"
}

# ============================================================================
# Analysis Functions
# ============================================================================

# Count patterns in file
count_pattern() {
    local file="$1"
    local pattern="$2"

    grep -c "$pattern" "$file" 2>/dev/null || echo "0"
}

# Count all error patterns
count_errors() {
    local file="$1"
    local total=0

    for pattern in "${ERROR_PATTERNS[@]}"; do
        local count
        count=$(count_pattern "$file" "$pattern")
        ((total+=count))
    done

    echo "$total"
}

# Count all warning patterns
count_warnings() {
    local file="$1"
    local total=0

    for pattern in "${WARNING_PATTERNS[@]}"; do
        local count
        count=$(count_pattern "$file" "$pattern")
        ((total+=count))
    done

    echo "$total"
}

# Extract unique error messages
get_unique_errors() {
    local file="$1"
    local limit="${2:-50}"

    grep -E "${ERROR_PATTERNS[*]}" "$file" 2>/dev/null | \
        sed 's/.*\(ERROR\|FATAL\|CRITICAL\).*/\1/' | \
        sort | uniq -c | sort -rn | head -n "$limit"
}

# Get time distribution
get_time_distribution() {
    local file="$1"

    # Try different time formats
    # Format: [22/Feb/2024:10:30:45
    if grep -qE '\[[0-9]{2}/[A-Za-z]{3}/[0-9]{4}:' "$file"; then
        grep -oE '\[[0-9]{2}/[A-Za-z]{3}/[0-9]{4}:[0-9]{2}' "$file" | \
            cut -d':' -f2 | sort | uniq -c | sort -rn
        return
    fi

    # Format: 2024-02-22 10:30:45
    if grep -qE '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:' "$file"; then
        grep -oE '[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:' "$file" | \
            cut -d' ' -f2 | cut -d':' -f1 | sort | uniq -c | sort -rn
        return
    fi

    # Format: Feb 22 10:30:45
    awk '{print $1, $2, $3}' "$file" | sort | uniq -c | sort -rn
}

# Get top errors by frequency
get_top_errors() {
    local file="$1"
    local num="${2:-10}"

    grep -E "${ERROR_PATTERNS[*]}" "$file" 2>/dev/null | \
        sed -E 's/.*(ERROR|FATAL|CRITICAL|EXCEPTION|FAILED)[,:]? ?(.*)/\2/' | \
        sed 's/^ *//;s/ *$//' | \
        grep -v '^$' | \
        sort | uniq -c | sort -rn | head -n "$num"
}

# Get HTTP status code distribution
get_status_distribution() {
    local file="$1"

    grep -oE ' [0-9]{3} ' "$file" | \
        sort | uniq -c | sort -rn | \
        awk '{
            status=$2
            count=$1
            if(status ~ /^2/) { category="2xx Success" }
            else if(status ~ /^3/) { category="3xx Redirect" }
            else if(status ~ /^4/) { category="4xx Client Error" }
            else if(status ~ /^5/) { category="5xx Server Error" }
            else { category="Other" }
            printf "  %s (%s): %d\n", status, category, count
        }'
}

# Get IP addresses (for access logs)
get_top_ips() {
    local file="$1"
    local num="${2:-10}"

    grep -oE '^[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' "$file" | \
        sort | uniq -c | sort -rn | head -n "$num"
}

# Get most requested URLs
get_top_urls() {
    local file="$1"
    local num="${2:-10}"

    grep -oE '"(GET|POST|PUT|DELETE|HEAD|OPTIONS|PATCH) [^"]+' "$file" | \
        sed 's/"//g' | sort | uniq -c | sort -rn | head -n "$num"
}

# ============================================================================
# Analysis Output Functions
# ============================================================================

# Analyze log file
analyze_log() {
    local file="$1"

    if [[ ! -f "$file" ]]; then
        log "ERROR: File not found: $file"
        return 1
    fi

    local total_lines
    total_lines=$(wc -l < "$file")
    local file_size
    file_size=$(du -h "$file" | cut -f1)

    echo ""
    print_color "$COLOR_BOLD" "╔══════════════════════════════════════════════════════════╗"
    print_color "$COLOR_BOLD" "║              Log Analysis Report                          ║"
    print_color "$COLOR_BOLD" "╚══════════════════════════════════════════════════════════╝"
    echo ""

    echo "File: $file"
    echo "Size: $file_size"
    echo "Total Lines: $total_lines"
    echo "Last Modified: $(stat -c '%y' "$file" 2>/dev/null || stat -f '%Sm' "$file")"
    echo ""

    # Error counts
    print_color "$COLOR_BOLD" "=== Error Summary ==="
    local error_count warning_count
    error_count=$(count_errors "$file")
    warning_count=$(count_warnings "$file")

    local error_pct warning_pct
    error_pct=$((error_count * 100 / total_lines))
    warning_pct=$((warning_count * 100 / total_lines))

    print_color "$COLOR_RED" "  Errors:   $error_count ($error_pct%)"
    print_color "$COLOR_YELLOW" "  Warnings: $warning_count ($warning_pct%)"
    echo ""

    # Per-pattern breakdown
    print_color "$COLOR_BOLD" "=== Error Breakdown ==="
    for pattern in "${ERROR_PATTERNS[@]}"; do
        local count
        count=$(count_pattern "$file" "$pattern")
        if [[ "$count" -gt 0 ]]; then
            print_color "$COLOR_RED" "  $pattern: $count"
        fi
    done
    echo ""

    # Top errors
    print_color "$COLOR_BOLD" "=== Top Error Messages ==="
    get_top_errors "$file" 10 | while read -r count message; do
        if [[ -n "$message" ]]; then
            echo "  [$count] $message"
        fi
    done
    echo ""

    # Time distribution
    print_color "$COLOR_BOLD" "=== Time Distribution (Hour) ==="
    get_time_distribution "$file" | head -10 | while read -r count hour; do
        printf "  %02s:00 - %d occurrences\n" "$hour" "$count"
    done
    echo ""

    # Check if it's an access log
    if grep -qE 'HTTP/[12]\.[01]' "$file" 2>/dev/null; then
        print_color "$COLOR_BOLD" "=== HTTP Status Codes ==="
        get_status_distribution "$file"
        echo ""

        print_color "$COLOR_BOLD" "=== Top IPs ==="
        get_top_ips "$file" 10 | while read -r count ip; do
            printf "  %-15s - %d requests\n" "$ip" "$count"
        done
        echo ""

        print_color "$COLOR_BOLD" "=== Top URLs ==="
        get_top_urls "$file" 10 | while read -r count url; do
            printf "  [%d] %s\n" "$count" "$url"
        done
        echo ""
    fi

    # Recent errors
    print_color "$COLOR_BOLD" "=== Recent Errors (Last 10) ==="
    grep -E "${ERROR_PATTERNS[*]}" "$file" 2>/dev/null | tail -10 | \
        sed 's/^/  /'
    echo ""
}

# Real-time monitoring
monitor_log() {
    local file="$1"

    if [[ ! -f "$file" ]]; then
        log "ERROR: File not found: $file"
        return 1
    fi

    log "Starting real-time monitoring of $file (Ctrl+C to stop)"
    echo ""

    tail -f "$file" | while IFS= read -r line; do
        if [[ "$line" =~ ERROR|FATAL|CRITICAL ]]; then
            print_color "$COLOR_RED" "[ERROR] $line"
        elif [[ "$line" =~ WARNING|WARN ]]; then
            print_color "$COLOR_YELLOW" "[WARN] $line"
        else
            echo "$line"
        fi
    done
}

# Search pattern
search_pattern() {
    local pattern="$1"
    local file="$2"

    if [[ ! -f "$file" ]]; then
        log "ERROR: File not found: $file"
        return 1
    fi

    log "Searching for: $pattern"
    local count
    count=$(grep -c "$pattern" "$file" 2>/dev/null || echo "0")
    echo "Found $count matches"
    echo ""

    grep -n "$pattern" "$file" 2>/dev/null | head -50 | \
        while read -r line; do
            if [[ "$line" =~ ERROR|FATAL|CRITICAL ]]; then
                print_color "$COLOR_RED" "$line"
            else
                echo "$line"
            fi
        done
}

# ============================================================================
# Main Function
# ============================================================================

main() {
    local command="${1:-analyze}"
    shift || true

    case "$command" in
        analyze)
            if [[ $# -lt 1 ]]; then
                echo "Usage: $0 analyze <log_file>"
                exit 1
            fi
            analyze_log "$1"
            ;;

        monitor)
            if [[ $# -lt 1 ]]; then
                echo "Usage: $0 monitor <log_file>"
                exit 1
            fi
            monitor_log "$1"
            ;;

        search)
            if [[ $# -lt 2 ]]; then
                echo "Usage: $0 search <pattern> <log_file>"
                exit 1
            fi
            search_pattern "$1" "$2"
            ;;

        stats)
            if [[ $# -lt 1 ]]; then
                echo "Usage: $0 stats <log_file>"
                exit 1
            fi
            analyze_log "$1"
            ;;

        errors)
            if [[ $# -lt 1 ]]; then
                echo "Usage: $0 errors <log_file>"
                exit 1
            fi
            grep -E "${ERROR_PATTERNS[*]}" "$1" 2>/dev/null || echo "No errors found"
            ;;

        *)
            echo "Usage: $0 {analyze|monitor|search|stats|errors} [options]"
            echo ""
            echo "Commands:"
            echo "  analyze <file>   Analyze log file and show statistics"
            echo "  monitor <file>   Real-time monitoring"
            echo "  search <pat> <f> Search for pattern in log"
            echo "  stats <file>     Show detailed statistics"
            echo "  errors <file>    Show all error lines"
            exit 1
            ;;
    esac
}

main "$@"

Running the Analyzer

# Make executable
chmod +x log-analyzer.sh

# Analyze system log
sudo ./log-analyzer.sh analyze /var/log/syslog

# Analyze nginx access log
./log-analyzer.sh analyze /var/log/nginx/access.log

# Monitor in real-time
sudo ./log-analyzer.sh monitor /var/log/syslog

# Search for specific pattern
./log-analyzer.sh search "connection refused" /var/log/nginx/error.log

# Show all errors
sudo ./log-analyzer.sh errors /var/log/syslog

Summary

In this chapter, you completed practical exercises covering:

✅ System Monitoring Dashboard - Complete production monitoring
✅ Backup Automation Script - Enterprise backup solution
✅ Log Analyzer - Multi-format log analysis

These exercises provide real-world DevOps experience with production-ready code.

Congratulations!

You have completed all 46 chapters of the bash-scripting-guide! This comprehensive guide covered everything needed for DevOps, SRE, and SysAdmin positions:

Basic to advanced bash concepts
Text processing (sed, awk, regex)
Process management
Error handling
Security best practices
Performance optimization
Real-world automation scripts
Interview preparation
Practical exercises

Previous Chapter: Interview Questions End of Guide