Disk_io_performance

Chapter 53: Disk I/O Performance

Overview

Disk I/O performance is critical for system responsiveness and application throughput. Understanding how Linux handles disk operations, from the physical device to the filesystem, is essential for troubleshooting performance issues and optimizing storage subsystems.

This chapter covers disk I/O performance monitoring, tuning, troubleshooting, and optimization techniques for both traditional spinning disks (HDD) and solid-state drives (SSD/NVMe) in production environments.

53.1 Linux I/O Stack Architecture

Understanding the I/O Path

┌─────────────────────────────────────────────────────────────────────────┐
│                        LINUX I/O STACK                                  │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Application                                                            │
│     │                                                                   │
│     ▼                                                                   │
│  ┌──────────────────────────────────────────────────────────────┐      │
│  │                    SYSTEM CALLS                               │      │
│  │  open(), read(), write(), lseek(), fsync(), mmap()          │      │
│  └──────────────────────────────────────────────────────────────┘      │
│     │                                                                   │
│     ▼                                                                   │
│  ┌──────────────────────────────────────────────────────────────┐      │
│  │                 VIRTUAL FILE SYSTEM (VFS)                    │      │
│  │  Provides unified interface for filesystems                  │      │
│  │  dentry cache, inode cache, page cache                       │      │
│  └──────────────────────────────────────────────────────────────┘      │
│     │                                                                   │
│     ▼                                                                   │
│  ┌──────────────────────────────────────────────────────────────┐      │
│  │                 SPECIFIC FILESYSTEM                          │      │
│  │  ext4, xfs, btrfs, zfs, ntfs                                │      │
│  │  - Metadata operations                                        │      │
│  │  - Data journaling/buffering                                 │      │
│  │  - Allocation algorithms                                      │      │
│  └──────────────────────────────────────────────────────────────┘      │
│     │                                                                   │
│     ▼                                                                   │
│  ┌──────────────────────────────────────────────────────────────┐      │
│  │                 BLOCK LAYER                                  │      │
│  │  I/O Scheduler (mq-deadline, bfq, none)                     │      │
│  │  Request merging, ordering, prioritization                   │      │
│  │  Multi-queue support (blk-mq)                                │      │
│  └──────────────────────────────────────────────────────────────┘      │
│     │                                                                   │
│     ▼                                                                   │
│  ┌────────────────────────────────────────────────────────────┐       │
│  │                 DEVICE DRIVERS                               │       │
│  │  SCSI, SATA, NVMe, IDE, MMC, virtio-blk                    │       │
│  │  - Command queuing                                          │       │
│  │  - Error recovery                                           │       │
│  └────────────────────────────────────────────────────────────┘       │
│     │                                                                   │
│     ▼                                                                   │
│  ┌────────────────────────────────────────────────────────────┐       │
│  │                 PHYSICAL DEVICE                              │       │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐     │       │
│  │  │     HDD     │    │     SSD    │    │    NVMe     │     │       │
│  │  │ (spinning)  │    │ (SATA/NVMe)│    │ (PCIe Gen4) │     │       │
│  │  │ 7,200 RPM   │    │  ~500K IOPS│    │ ~1M+ IOPS   │     │       │
│  │  │ ~100 IOPS   │    │ ~500 MB/s  │    │ ~7 GB/s     │     │       │
│  │  └─────────────┘    └─────────────┘    └─────────────┘     │       │
│  └────────────────────────────────────────────────────────────┘       │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

I/O Request Flow

┌────────────────────────────────────────────────────────────────────────┐
│                    I/O REQUEST LIFECYCLE                               │
├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│  1. APPLICATION                                                       │
│     read(fd, buffer, 4096)                                            │
│            │                                                           │
│            ▼                                                           │
│  2. VFS - Check Page Cache                                            │
│     ┌─────────────────────────────────────┐                          │
│     │ Page Cache Hit?                     │                          │
│     │  YES ──► Return cached data         │                          │
│     │  NO  ──► Generate bio request       │                          │
│     └─────────────────────────────────────┘                          │
│            │                                                           │
│            ▼                                                           │
│  3. FILESYSTEM LAYER                                                  │
│     - Translate file offset to block number                          │
│     - Handle journaling (if applicable)                                │
│     - Update metadata                                                 │
│            │                                                           │
│            ▼                                                           │
│  4. BLOCK LAYER                                                       │
│     ┌──────────────────────────────────────────┐                     │
│     │ I/O Scheduler                            │                     │
│     │ - Merge adjacent requests                │                     │
│     │ - Sort by sector number                  │                     │
│     │ - Apply QoS policies                      │                     │
│     │ - Batch for throughput                   │                     │
│     └──────────────────────────────────────────┘                     │
│            │                                                           │
│            ▼                                                           │
│  5. DEVICE DRIVER                                                     │
│     - Build device command                                             │
│     - Queue in hardware (NCQ, SQH)                                    │
│     - Submit to controller                                             │
│            │                                                           │
│            ▼                                                           │
│  6. PHYSICAL DEVICE                                                   │
│     - Mechanical seek (HDD) / Flash read (SSD)                       │
│     - Data transfer                                                   │
│            │                                                           │
│            ▼                                                           │
│  7. COMPLETION                                                        │
│     - Interrupt通知                                                    │
│     - Update page cache                                               │
│     - Wake waiting process                                            │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

53.2 Disk Information and Identification

Identifying Block Devices

# List all block devices with details
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,ROTA,MODEL
# ROTA=1 for spinning disks, ROTA=0 for SSD/NVMe

# Detailed block device information
lsblk -f

# Block device attributes
blkid

# Detailed device information
ls -la /sys/block/
ls -la /sys/block/sda/

# Get device size in bytes
blockdev --getsize64 /dev/sda

# Get device sector count
blockdev --getsz /dev/sda

# Check device model and vendor
cat /sys/block/sda/device/model
cat /sys/block/sda/device/vendor
cat /sys/block/sda/device/rev

Disk Usage Analysis

# Filesystem usage (human-readable)
df -h

# Inode usage (important for many small files)
df -i

# Directory sizes (sorted)
du -sh /var/* 2>/dev/null | sort -hr
du -h --max-depth=1 / | sort -hr

# Largest files in directory
find /var -type f -exec du -h {} + 2>/dev/null | sort -rh | head -20

# Disk usage by file type
du -sh /var/log/*.log
du -sh /var/*/tmp

# Real-time disk usage monitoring
watch -n 1 'df -h /'

# UUIDs and labels
blkid
ls -la /dev/disk/by-uuid/
ls -la /dev/disk/by-label/

Partition Information

# MBR partition table
fdisk -l /dev/sda

# GPT partition table (modern)
gdisk -l /dev/sda

# Parted (advanced)
parted -l

# Show partition UUIDs
blkid | grep /dev/sda

# Partition details from /proc
cat /proc/partitions

# Partition alignment check
parted /dev/sda align-check opt 1

53.3 I/O Monitoring Tools

iostat - I/O Statistics

# Extended statistics (most useful)
iostat -x 1

# Per-device statistics
iostat -x sda 1

# CPU and device statistics
iostat -c -x 1

# Show kilobytes (better for large I/O)
iostat -xk 1

# Report utilization with timestamps
iostat -t -x 1

# Custom interval with count
iostat -x 5 3  # 3 reports, 5-second intervals

Key iostat Metrics Explained:

Metric	Description	Healthy Value	Concern Value
`r/s`	Read requests completed/sec	Varies by workload	>1000 sustained
`w/s`	Write requests completed/sec	Varies by workload	>1000 sustained
`rkB/s`	Kilobytes read/sec	Depends on workload	N/A
`wkB/s`	Kilobytes written/sec	Depends on workload	N/A
`await`	Average I/O wait time (ms)	<10ms	>50ms
`%util`	Device utilization	<70%	>90% sustained
`svctm`	Average service time (ms)	<10ms	>20ms
`aqu-sz`	Average queue length	<2	>10
`rrqm/s`	Read requests merged/sec	Moderate	Very high
`wrqm/s`	Write requests merged/sec	Moderate	Very high

iotop - Per-Process I/O

# Interactive I/O monitoring (requires root)
iotop

# Show only active I/O processes
iotop -o

# Show accumulated I/O instead of rate
iotop -a

# Batch mode (useful for logging)
iotop -b -o -n 3  # 3 iterations

# Show processes doing I/O
iotop -P

# Show total I/O
iotop -o -P

# Keyboard shortcuts in iotop:
#   o - show only active I/O
#   p - show processes
#   a - show accumulated I/O
#   q - quit

pidstat - Process I/O Statistics

# I/O by process (1-second intervals)
pidstat -d 1

# I/O by PID
pidstat -d -p 1234 1

# Extended I/O statistics
pidstat -d -x 1

# Summary since boot
pidstat -d -h 1

vmstat - Virtual Memory Statistics

# All system statistics
vmstat 1

# Specific columns (bi=blocks in, bo=blocks out)
vmstat 1

# Procs and memory
vmstat -m

# Slab info
vmstat -S m 1  # MB units

Key vmstat columns for I/O:

Column	Description	Concern Value
`si`	Swap in (from disk)	>0 sustained
`so`	Swap out (to disk)	>0 sustained
`bi`	Blocks in (disk read)	High
`bo`	Blocks out (disk write)	High
`wa`	I/O wait CPU	>20%
`cs`	Context switches	Very high
`us`	User CPU + wait	>80%

sar - System Activity Reporter

# Disk I/O statistics
sar -d 1

# CPU and I/O
sar -b 1

# All activities
sar -A 1

# Historical from logs
sar -d -f /var/log/sa/sa01

# Report format
sar -d 1 1 | head -20

Advanced Monitoring Scripts

#!/bin/bash
# monitor_io.sh - Continuous I/O monitoring

while true; do
    clear
    echo "=== Disk I/O Statistics ==="
    date

    echo -e "\n--- Device Utilization ---"
    iostat -x 1 1 | tail -n +7

    echo -e "\n--- Top I/O Processes ---"
    if command -v iotop &> /dev/null; then
        iotop -b -o -n 1 | head -10
    else
        ps aux | sort -k6 -r | head -10
    fi

    echo -e "\n--- I/O Wait ---"
    vmstat 1 2 | tail -1

    sleep 5
done

53.4 I/O Scheduler Deep Dive

Available Schedulers

┌────────────────────────────────────────────────────────────────────────┐
│                    I/O SCHEDULER COMPARISON                            │
├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│  ┌─────────────────┬────────────────────────────────────────────────┐ │
│  │ Scheduler       │ Best For                                        │ │
│  ├─────────────────┼────────────────────────────────────────────────┤ │
│  │ mq-deadline     │ Database, VM storage, mixed workloads          │ │
│  │                 │ Low latency, prevents starvation                │ │
│  ├─────────────────┼────────────────────────────────────────────────┤ │
│  │ bfq             │ Desktop, interactive, multimedia                │ │
│  │                 │ Fair queuing, low latency                      │ │
│  ├─────────────────┼────────────────────────────────────────────────┤ │
│  │ none            │ SSD, NVMe, direct I/O                           │ │
│  │                 │ No seek penalty, pure throughput               │ │
│  ├─────────────────┼────────────────────────────────────────────────┤ │
│  │ cfq (legacy)    │ Older systems, spinning disks                  │ │
│  │                 │ Deprecated in favor of mq-deadline             │ │
│  └─────────────────┴────────────────────────────────────────────────┘ │
│                                                                        │
│  NOTE: On modern kernels (4.x+), most use blk-mq (block multi-queue)  │
│        which provides better performance for modern devices          │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Scheduler Configuration

# View current scheduler (for each queue)
cat /sys/block/sda/queue/scheduler
cat /sys/block/nvme0n1/queue/scheduler

# List all available schedulers
cat /sys/block/sda/queue/scheduler
# Output: [none] mq-deadline bfq cfq

# Temporarily change scheduler (runtime)
echo "mq-deadline" | sudo tee /sys/block/sda/queue/scheduler

# For SSD/NVMe - use 'none'
echo "none" | sudo tee /sys/block/nvme0n1/queue/scheduler

# Verify change
cat /sys/block/sda/queue/scheduler

Persistent Scheduler Configuration

Method 1: udev Rules

# For SSD/NVMe - no scheduler needed
ACTION=="add|change", KERNEL=="sd[a-z]|nvme[0-9]n1", ATTR{queue/scheduler}="none"

# For HDD - use mq-deadline
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="mq-deadline"

# Reload udev rules
sudo udevadm control --reload-rules
sudo udevadm trigger

Method 2: Kernel Boot Parameters

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash elevator=mq-deadline"

# Apply
sudo update-grub

Method 3: systemd service

[Unit]
Description=Set I/O Scheduler
After=local-fs.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo mq-deadline > /sys/block/sda/queue/scheduler'
ExecStart=/bin/sh -c 'echo none > /sys/block/nvme0n1/queue/scheduler'

[Install]
WantedBy=multi-user.target

Scheduler-Specific Tuning

# mq-deadline tuning (default usually fine)
cat /sys/block/sda/queue/iosched/
# read_expire (ms), write_expire (ms), fifo_batch, front_merges

# Example: Lower latency for writes
echo 100 > /sys/block/sda/queue/iosched/write_expire
echo 50 > /sys/block/sda/queue/iosched/read_expire

# bfq tuning
cat /sys/block/sda/queue/iosched/
# bfq_slice_id_async, bfq_slice_idle, bfq_max_ratio, etc.

# Disable front merges (can hurt performance)
echo 0 > /sys/block/sda/queue/add_random

53.5 I/O Tuning Parameters

Virtual Memory and Dirty Pages

# /etc/sysctl.conf - Disk I/O related parameters

# Dirty page ratios
vm.dirty_ratio = 15          # % of RAM for dirty pages (single process)
vm.dirty_background_ratio = 5 # % of RAM for background writeback

# More aggressive for database servers
vm.dirty_ratio = 40
vm.dirty_background_ratio = 10
vm.dirty_expire_centisecs = 3000  # When dirty pages are "old"
vm.dirty_writeback_centisecs = 500  # Writeback interval

# Page cache tuning
vm.vfs_cache_pressure = 100  # Higher = more aggressive reclaim

# Memory management
vm.min_free_kbytes = 65536   # Reserve for emergency allocations

# Apply changes
sudo sysctl -p

Block Device Parameters

# Read-ahead (default usually 128KB/256KB)
blockdev --getra /dev/sda
# Returns in 512-byte sectors, so 256 = 128KB

# Set read-ahead
blockdev --setra 4096 /dev/sda  # 2MB for sequential reads

# Permanently set (udev)
# /etc/udev/rules.d/60-readahead.rules
ACTION=="add|change", KERNEL=="sda", ATTR{bdi/read_ahead_kb}="4096"

# Queue depth
cat /sys/block/sda/queue/nr_requests
# Default 128, can increase for high-IOPS

# Request size
cat /sys/block/sda/queue/max_sectors_kb
# Max request size in KB

# Enable/disable features
echo 0 > /sys/block/sda/queue/add_random   # Disable
echo 0 > /sys/block/sda/queue/iostats       # Disable I/O stats

Filesystem Mount Options

# /etc/fstab examples

# Optimized ext4 for SSD
/dev/sda1 /               ext4    defaults,noatime,nodiratime,errors=remount-ro  0  1

# Optimized for data
/dev/sdb1 /data           ext4    defaults,noatime,nodiratime,data=writeback  0  2

# XFS for large files
/dev/sdc1 /var/log        xfs     defaults,noatime,nodiratime,logbufs=8    0  2

# Common options:
# noatime    - Don't update access time (major win)
# nodiratime - Don't update directory access time
# relatime   - Only update atime if older than mtime/ctime
# data=journal - Full journaling (slower but safer)
# data=writeback - Faster, less safe
# barrier=0 - Disable barriers (dangerous on power loss)
# nobarrier - Same as barrier=0
# commit=60 - Sync every 60 seconds (default 5)

Kernel Parameters Reference

# All I/O related sysctls
sysctl -a | grep -E "vm\.(dirty|writeback|vfs_cache)"

# Important parameters
vm.dirty_ratio                    # Max dirty before synchronous write
vm.dirty_background_ratio        # Start background writeback
vm.dirty_expire_centisecs        # When dirty data is "old"
vm.dirty_writeback_centisecs     # Writeback daemon interval
vm.dirty_time_centisecs         # Track writeback timing
vm.vfs_cache_pressure           # Dent/inode cache reclaim tendency
vm.min_free_kbytes              # Reserved memory
vm.overcommit_memory             # Memory overcommit
vm.overcommit_ratio              # Overcommit percentage

53.6 SSD and NVMe Optimization

SSD-Specific Tuning

┌────────────────────────────────────────────────────────────────────────┐
│                    SSD OPTIMIZATION CHECKLIST                          │
├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│  □ Use 'none' I/O scheduler                                           │
│  □ Enable TRIM support                                                │
│  □ Disable access time updates (noatime)                             │
│  □ Align partitions to 1MB (4096 sectors)                            │
│  □ Enable discard mount option (or use fstrim.timer)                │
│  □ Disable journaling if safe (data=writeback)                       │
│  □ Use ext4 or xfs (mature, well-tested)                              │
│  □ Disable swap if enough RAM                                         │
│  □ Increase read-ahead for sequential workloads                       │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Enabling TRIM

# Check if device supports TRIM
hdparm -I /dev/sda | grep -i trim

# Or for NVMe
nvme id-ctrl /dev/nvme0 | grep -i trim

# Manual TRIM
sudo fstrim -v /

# Enable weekly TRIM (recommended)
sudo systemctl enable fstrim.timer
sudo systemctl start fstrim.timer

# Check timer status
systemctl status fstrim.timer

# fstrim mount option (mount-time TRIM)
# /etc/fstab
/dev/sda1 / ext4 defaults,discard 0 1
# Note: discard can hurt performance on some SSDs

Partition Alignment

# Check alignment (should be 0 for optimal)
parted /dev/sda
(parted) align-check optimal 1

# Create aligned partition
parted /dev/sda
(parted) mklabel gpt
(parted) mkpart primary ext4 1MiB -1
# Start at 1MiB ensures alignment

# Or use sfdisk
sfdisk -uM /dev/sda << 'EOF'
label: gpt
device: /dev/sda
unit: sectors

/dev/sda1 : start=2048, size=524288, type=0FC63DAF-8483-4772-8E79-3D69D8477DF4
EOF
# 2048 sectors = 1MB aligned for 4K sectors

NVMe Optimization

# NVMe-specific settings
# Most are already optimized by default

# Check NVMe info
nvme list
nvme id-ctrl /dev/nvme0

# NVMe power state
nvme get-feature /dev/nvme0 -f 2  # Power state

# Set IO queue depth (if needed)
echo 1024 > /sys/block/nvme0n1/queue/nr_requests

# Disable polling (interrupt is usually better)
cat /sys/block/nvme0n1/queue/rotational
# Should be 0

# PCIe settings
lspci -vv | grep -A10 NVMe

53.7 Troubleshooting I/O Issues

Diagnosing High I/O Wait

#!/bin/bash
echo "=== High I/O Wait Diagnosis ==="
echo ""

echo "--- vmstat output (check 'wa' column) ---"
vmstat 1 5

echo -e "\n--- iostat (check '%util' and 'await') ---"
iostat -x 1 5

echo -e "\n--- Top I/O consuming processes ---"
if command -v iotop &> /dev/null; then
    iotop -b -n 3 -o | head -20
else
    ps aux | sort -k6 -r | head -10
fi

echo -e "\n--- I/O by filesystem ---"
df -h
mount | column -t

echo -e "\n--- Check for I/O errors ---"
dmesg | grep -i "i/o error" | tail -20
dmesg | grep -i "sd[a-z]" | tail -20

echo -e "\n--- Check for filesystem issues ---"
cat /proc/mounts

Common I/O Issues and Solutions

Issue	Symptoms	Diagnosis	Solution
High I/O wait	`wa` > 20% in vmstat	iostat, iotop	Add I/O capacity, optimize workloads
High await	await > 50ms	iostat -x	Check disk health, reduce queue depth
Disk bottleneck	%util ~100%	iostat	Upgrade storage, add cache
Fragmentation	Slow sequential reads	fsck, iostat	Defragment, balance
Full filesystem	ENOSPC errors	df -i, df -h	Add space, clean up
Inode exhaustion	ENOSPC but space available	df -i	Clean small files
LVM issues	Slow I/O	lvs, pvs	Check thin pool, cache
NFS issues	High latency	nfsstat, iostat	Tune NFS options

Finding I/O Heavy Processes

# Method 1: Using iotop (requires root)
sudo iotop -o -P

# Method 2: Using pidstat
pidstat -d 1 | grep -v "^$"

# Method 3: From /proc
for pid in $(ls /proc | grep -E '^[0-9]+$'); do
    if [ -f /proc/$pid/io ]; then
        echo "PID $pid: $(cat /proc/$pid/io 2>/dev/null | grep -E "read_bytes|write_bytes")"
    fi
done | sort -k2 -r | head -10

# Method 4: Using sar
sar -d 1 | tail -20

Disk Health Monitoring

# SMART status
sudo smartctl -a /dev/sda
sudo smartctl -H /dev/sda

# Short SMART test
sudo smartctl -t short /dev/sda

# Long SMART test
sudo smartctl -t long /dev/sda

# Check SMART results
sudo smartctl -l selftest /dev/sda

# NVMe health
sudo nvme smart-log /dev/nvme0
sudo nvme health-log /dev/nvme0

Checking Filesystem Health

# Check filesystem (unmount first!)
sudo umount /dev/sda1
sudo fsck -f /dev/sda1

# Check without unmount (read-only)
sudo fsck -n /dev/sda1

# xfs_repair
sudo xfs_repair -n /dev/sda1  # check only
sudo xfs_repair /dev/sda1

# btrfs check
sudo btrfs check --readonly /dev/sda1

53.8 Performance Benchmarking

fio - Flexible I/O Tester

# Install fio
sudo apt-get install fio  # Debian/Ubuntu
sudo yum install fio     # RHEL/CentOS

# Sequential read test
fio --name=seqread --ioengine=libaio --direct=1 --bs=4k --iodepth=32 --numjobs=4 --rw=read --size=1G --runtime=60 --time_based --group_reporting

# Sequential write test
fio --name=seqwrite --ioengine=libaio --direct=1 --bs=4k --iodepth=32 --numjobs=4 --rw=write --size=1G --runtime=60 --time_based --group_reporting

# Random read test (4K)
fio --name=randread --ioengine=libaio --direct=1 --bs=4k --iodepth=64 --numjobs=4 --rw=randread --size=1G --runtime=60 --time_based --group_reporting

# Random write test (4K)
fio --name=randwrite --ioengine=libaio --direct=1 --bs=4k --iodepth=64 --numjobs=4 --rw=randwrite --size=1G --runtime=60 --time_based --group_reporting

# Mixed workload (70% read, 30% write)
fio --name=mixed --ioengine=libaio --direct=1 --bs=4k --iodepth=32 --numjobs=4 --rw=randrw --rwmixread=70 --size=1G --runtime=60 --time_based --group_reporting

# Quick benchmark
fio --name=quick --ioengine=sync --direct=0 --bs=4k --numjobs=1 --rw=randread --size=100M --runtime=30 --time_based

hdparm - Simple Benchmark

# Sequential read speed
sudo hdparm -t /dev/sda
sudo hdparm -tT /dev/sda  # cached and buffer

# For multiple runs
for i in 1 2 3; do sudo hdparm -t /dev/sda; done

Bonnie++ - Filesystem Benchmark

# Install
sudo apt-get install bonnie++

# Run benchmark
bonnie++ -u root -d /tmp -s 2048 -n 100

# HTML report
bonnie++ -u root -d /tmp -s 4096 -f -b -n 10 | bon_csv2html > report.html

dd - Simple Test

# Sequential write test
dd if=/dev/zero of=/tmp/testfile bs=1M count=1024 oflag=direct

# Sequential read test
dd if=/tmp/testfile of=/dev/null bs=1M count=1024 iflag=direct

# With performance reporting
dd if=/dev/zero of=/tmp/testfile bs=1M count=1024 oflag=direct status=progress

# Note: dd is not accurate for benchmarking, use fio for proper results

53.9 Production Configuration Examples

Database Server (PostgreSQL/MySQL)

# /etc/sysctl.conf additions for database
vm.dirty_ratio = 40
vm.dirty_background_ratio = 10
vm.dirty_expire_centisecs = 5000
vm.dirty_writeback_centisecs = 500
vm.swappiness = 10
vm.overcommit_memory = 2
vm.overcommit_ratio = 95

# Apply
sudo sysctl -p

# /etc/fstab for database partition
/dev/sdb1 /var/lib/postgresql   ext4    defaults,noatime,nodiratime,data=writeback,barrier=0  0 2

# I/O scheduler (none for SSD)
echo none > /sys/block/sda/queue/scheduler

# Queue settings
echo 256 > /sys/block/sda/queue/nr_requests
echo 4096 > /sys/block/sda/queue/read_ahead_kb

Web Server / Cache Server

# /etc/sysctl.conf for web server
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5
vm.vfs_cache_pressure = 50

# Allow more open files
fs.file-max = 65536

# Apply
sudo sysctl -p

# /etc/security/limits.conf
*    soft    nofile    65535
*    hard    nofile    65535

# Mount options for webroot
/dev/sda1 /var/www   ext4    defaults,noatime,nodiratime,data=writeback  0 2

High-Performance File Server

vm.dirty_ratio = 60
vm.dirty_background_ratio = 20
vm.dirty_expire_centisecs = 10000
vm.dirty_writeback_centisecs = 1000

# Apply
sudo sysctl -p

# Mount options
# Use XFS for large files
/dev/sdc1 /exports/shares    xfs    defaults,noatime,nodiratime,logbufs=8,logdev=/dev/sdd1  0 2

# I/O scheduler
echo mq-deadline > /sys/block/sda/queue/scheduler

53.10 Interview Questions

Q1: How do you diagnose high I/O wait on a Linux system?

Answer:

Use vmstat 1 to check the wa (I/O wait) column - values above 20% indicate I/O bottleneck
Use iostat -x 1 to identify which devices are saturated - check %util near 100% and high await times
Use iotop to find processes causing the most I/O
Check if it’s a disk capacity issue with df -h and df -i
Verify disk health with smartctl -a /dev/sda
Check for I/O errors in dmesg | grep -i error

Q2: What is the difference between mq-deadline, bfq, and none I/O schedulers?

Answer:

mq-deadline: Best for most workloads including databases and VMs. Implements deadline-based scheduling to prevent request starvation, merges adjacent requests, sorts by sector for disk efficiency
bfq (Budget Fair Queuing): Best for desktop/interactive workloads. Provides fair queuing to prevent a single process from monopolizing I/O, better latency for interactive applications
none: Best for SSD/NVMe devices that have no seek penalty. Bypasses scheduler overhead entirely, lets the device’s internal queue handle optimization

Q3: How do you optimize an SSD for better performance?

Answer:

Use none I/O scheduler (no seek penalty)
Enable TRIM via fstrim.timer or mount option discard
Use noatime,nodiratime mount options to avoid unnecessary writes
Ensure partitions are aligned to 1MB boundaries
Disable or reduce swap if sufficient RAM available
Use modern filesystems like ext4 or xfs with appropriate options
Increase read-ahead for sequential workloads: blockdev --setra 4096 /dev/sda

Q4: What is the difference between synchronous and asynchronous I/O?

Answer:

Synchronous I/O: Process blocks until I/O completes. Simpler programming model, ensures data is on disk before continuing. Default for most applications. Can limit throughput.
Asynchronous I/O: Process continues while I/O proceeds in background. Higher throughput for concurrent operations. More complex programming. Examples: io_submit(), aio_read(), io_uring (modern)

Applications can use O_SYNC for synchronous writes, or O_DIRECT for direct I/O bypassing page cache.

Q5: How does the Linux page cache work and how does it affect I/O performance?

Answer:

The page cache stores recently accessed disk data in RAM
Reads are served from cache if available (very fast)
Writes go to page cache first (write-back), then flushed to disk
This dramatically improves performance for repeated reads
The dirty_ratio and dirty_background_ratio control when data is written
vm.vfs_cache_pressure controls how aggressively dentry/inode cache is reclaimed
Can be bypassed with O_DIRECT flag for applications needing explicit control

Q6: What is the relationship between ionice and I/O scheduling?

Answer: ionice sets the I/O scheduling class and priority for processes:

Real-time (class 1): Highest priority, can starve other processes
Best-effort (class 2): Default, priority 0-7
Idle (class 3): Only runs when disk is idle

Examples:

# Run rsync at idle priority
ionice -c 3 -p $(pgrep rsync)

# Run database at real-time
ionice -c 1 -n 0 -p $(pgrep mysql)

# Run with best-effort low priority
ionice -c 2 -n 7 -p $(pgrep tar)

Q7: How do you identify and resolve I/O bottlenecks in a production environment?

Answer:

Identify: Use iostat -x to find devices with high %util and await
Analyze: Use iotop to find culprit processes
Check: Verify disk health, capacity, and fragmentation
Optimize:
- Add cache (L2ARC, bcache, dm-cache)
- Use faster storage (SSD, NVMe)
- Tune I/O scheduler
- Adjust dirty page parameters
- Use appropriate filesystem options
- Consider load distribution (RAID, distributed storage)
Monitor: Set up alerting for I/O metrics

Q8: Explain the difference between write-through and write-back caching.

Answer:

Write-through: Data is written to cache AND to storage simultaneously. Slower writes but never loses data on power failure. Used when data integrity is critical.
Write-back: Data is written to cache first, then asynchronously flushed to storage. Much faster writes but risk of data loss on crash. This is the default in Linux.

Configure write-back with barrier=0 mount option (dangerous), or write-through with barrier=1 (default in ext4 with journal).

Quick Reference

Essential Commands

# Monitoring
iostat -x 1              # I/O stats
iotop                   # Per-process I/O
vmstat 1                # System stats
pidstat -d 1            # Process I/O

# Configuration
cat /sys/block/*/queue/scheduler    # View scheduler
echo none > /sys/block/sda/queue/scheduler  # Set scheduler
blockdev --setra 4096 /dev/sda     # Set read-ahead

# Filesystem
mount -o noatime,nodiratime /dev/sda1 /mnt
tune2fs -o noatime /dev/sda1

# Health
smartctl -a /dev/sda
fstrim -v /

# Benchmark
fio --name=test --rw=randread --size=1G --runtime=30

Key Metrics

Metric	Healthy	Warning	Critical
%util	<50%	50-80%	>80%
await	<10ms	10-50ms	>50ms
wa (vmstat)	<10%	10-30%	>30%
aqu-sz	<2	2-10	>10

sysctl Parameters

vm.dirty_ratio              # 15-40%
vm.dirty_background_ratio   # 5-10%
vm.dirty_expire_centisecs   # 3000-5000
vm.dirty_writeback_centisecs # 500-1000
vm.vfs_cache_pressure       # 50-100
vm.swappiness               # 10-60

Summary

In this chapter, you learned:

✅ Linux I/O stack architecture and request flow
✅ Disk information and identification commands
✅ I/O monitoring with iostat, iotop, pidstat, vmstat
✅ I/O scheduler comparison and configuration
✅ Virtual memory and filesystem tuning parameters
✅ SSD and NVMe-specific optimizations
✅ Troubleshooting high I/O wait and bottlenecks
✅ Performance benchmarking with fio
✅ Production configuration examples
✅ Interview questions and answers

Next Chapter

Chapter 54: Network Performance

Last Updated: February 2026