Disk_io_performance
Chapter 53: Disk I/O Performance
Section titled “Chapter 53: Disk I/O Performance”Overview
Section titled “Overview”Disk I/O performance is critical for system responsiveness and application throughput. Understanding how Linux handles disk operations, from the physical device to the filesystem, is essential for troubleshooting performance issues and optimizing storage subsystems.
This chapter covers disk I/O performance monitoring, tuning, troubleshooting, and optimization techniques for both traditional spinning disks (HDD) and solid-state drives (SSD/NVMe) in production environments.
53.1 Linux I/O Stack Architecture
Section titled “53.1 Linux I/O Stack Architecture”Understanding the I/O Path
Section titled “Understanding the I/O Path”┌─────────────────────────────────────────────────────────────────────────┐│ LINUX I/O STACK │├─────────────────────────────────────────────────────────────────────────┤│ ││ Application ││ │ ││ ▼ ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ SYSTEM CALLS │ ││ │ open(), read(), write(), lseek(), fsync(), mmap() │ ││ └──────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ VIRTUAL FILE SYSTEM (VFS) │ ││ │ Provides unified interface for filesystems │ ││ │ dentry cache, inode cache, page cache │ ││ └──────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ SPECIFIC FILESYSTEM │ ││ │ ext4, xfs, btrfs, zfs, ntfs │ ││ │ - Metadata operations │ ││ │ - Data journaling/buffering │ ││ │ - Allocation algorithms │ ││ └──────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌──────────────────────────────────────────────────────────────┐ ││ │ BLOCK LAYER │ ││ │ I/O Scheduler (mq-deadline, bfq, none) │ ││ │ Request merging, ordering, prioritization │ ││ │ Multi-queue support (blk-mq) │ ││ └──────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌────────────────────────────────────────────────────────────┐ ││ │ DEVICE DRIVERS │ ││ │ SCSI, SATA, NVMe, IDE, MMC, virtio-blk │ ││ │ - Command queuing │ ││ │ - Error recovery │ ││ └────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ ┌────────────────────────────────────────────────────────────┐ ││ │ PHYSICAL DEVICE │ ││ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ ││ │ │ HDD │ │ SSD │ │ NVMe │ │ ││ │ │ (spinning) │ │ (SATA/NVMe)│ │ (PCIe Gen4) │ │ ││ │ │ 7,200 RPM │ │ ~500K IOPS│ │ ~1M+ IOPS │ │ ││ │ │ ~100 IOPS │ │ ~500 MB/s │ │ ~7 GB/s │ │ ││ │ └─────────────┘ └─────────────┘ └─────────────┘ │ ││ └────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘I/O Request Flow
Section titled “I/O Request Flow”┌────────────────────────────────────────────────────────────────────────┐│ I/O REQUEST LIFECYCLE │├────────────────────────────────────────────────────────────────────────┤│ ││ 1. APPLICATION ││ read(fd, buffer, 4096) ││ │ ││ ▼ ││ 2. VFS - Check Page Cache ││ ┌─────────────────────────────────────┐ ││ │ Page Cache Hit? │ ││ │ YES ──► Return cached data │ ││ │ NO ──► Generate bio request │ ││ └─────────────────────────────────────┘ ││ │ ││ ▼ ││ 3. FILESYSTEM LAYER ││ - Translate file offset to block number ││ - Handle journaling (if applicable) ││ - Update metadata ││ │ ││ ▼ ││ 4. BLOCK LAYER ││ ┌──────────────────────────────────────────┐ ││ │ I/O Scheduler │ ││ │ - Merge adjacent requests │ ││ │ - Sort by sector number │ ││ │ - Apply QoS policies │ ││ │ - Batch for throughput │ ││ └──────────────────────────────────────────┘ ││ │ ││ ▼ ││ 5. DEVICE DRIVER ││ - Build device command ││ - Queue in hardware (NCQ, SQH) ││ - Submit to controller ││ │ ││ ▼ ││ 6. PHYSICAL DEVICE ││ - Mechanical seek (HDD) / Flash read (SSD) ││ - Data transfer ││ │ ││ ▼ ││ 7. COMPLETION ││ - Interrupt通知 ││ - Update page cache ││ - Wake waiting process ││ │└────────────────────────────────────────────────────────────────────────┘53.2 Disk Information and Identification
Section titled “53.2 Disk Information and Identification”Identifying Block Devices
Section titled “Identifying Block Devices”# List all block devices with detailslsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,ROTA,MODEL# ROTA=1 for spinning disks, ROTA=0 for SSD/NVMe
# Detailed block device informationlsblk -f
# Block device attributesblkid
# Detailed device informationls -la /sys/block/ls -la /sys/block/sda/
# Get device size in bytesblockdev --getsize64 /dev/sda
# Get device sector countblockdev --getsz /dev/sda
# Check device model and vendorcat /sys/block/sda/device/modelcat /sys/block/sda/device/vendorcat /sys/block/sda/device/revDisk Usage Analysis
Section titled “Disk Usage Analysis”# Filesystem usage (human-readable)df -h
# Inode usage (important for many small files)df -i
# Directory sizes (sorted)du -sh /var/* 2>/dev/null | sort -hrdu -h --max-depth=1 / | sort -hr
# Largest files in directoryfind /var -type f -exec du -h {} + 2>/dev/null | sort -rh | head -20
# Disk usage by file typedu -sh /var/log/*.logdu -sh /var/*/tmp
# Real-time disk usage monitoringwatch -n 1 'df -h /'
# UUIDs and labelsblkidls -la /dev/disk/by-uuid/ls -la /dev/disk/by-label/Partition Information
Section titled “Partition Information”# MBR partition tablefdisk -l /dev/sda
# GPT partition table (modern)gdisk -l /dev/sda
# Parted (advanced)parted -l
# Show partition UUIDsblkid | grep /dev/sda
# Partition details from /proccat /proc/partitions
# Partition alignment checkparted /dev/sda align-check opt 153.3 I/O Monitoring Tools
Section titled “53.3 I/O Monitoring Tools”iostat - I/O Statistics
Section titled “iostat - I/O Statistics”# Extended statistics (most useful)iostat -x 1
# Per-device statisticsiostat -x sda 1
# CPU and device statisticsiostat -c -x 1
# Show kilobytes (better for large I/O)iostat -xk 1
# Report utilization with timestampsiostat -t -x 1
# Custom interval with countiostat -x 5 3 # 3 reports, 5-second intervalsKey iostat Metrics Explained:
| Metric | Description | Healthy Value | Concern Value |
|---|---|---|---|
r/s | Read requests completed/sec | Varies by workload | >1000 sustained |
w/s | Write requests completed/sec | Varies by workload | >1000 sustained |
rkB/s | Kilobytes read/sec | Depends on workload | N/A |
wkB/s | Kilobytes written/sec | Depends on workload | N/A |
await | Average I/O wait time (ms) | <10ms | >50ms |
%util | Device utilization | <70% | >90% sustained |
svctm | Average service time (ms) | <10ms | >20ms |
aqu-sz | Average queue length | <2 | >10 |
rrqm/s | Read requests merged/sec | Moderate | Very high |
wrqm/s | Write requests merged/sec | Moderate | Very high |
iotop - Per-Process I/O
Section titled “iotop - Per-Process I/O”# Interactive I/O monitoring (requires root)iotop
# Show only active I/O processesiotop -o
# Show accumulated I/O instead of rateiotop -a
# Batch mode (useful for logging)iotop -b -o -n 3 # 3 iterations
# Show processes doing I/Oiotop -P
# Show total I/Oiotop -o -P
# Keyboard shortcuts in iotop:# o - show only active I/O# p - show processes# a - show accumulated I/O# q - quitpidstat - Process I/O Statistics
Section titled “pidstat - Process I/O Statistics”# I/O by process (1-second intervals)pidstat -d 1
# I/O by PIDpidstat -d -p 1234 1
# Extended I/O statisticspidstat -d -x 1
# Summary since bootpidstat -d -h 1vmstat - Virtual Memory Statistics
Section titled “vmstat - Virtual Memory Statistics”# All system statisticsvmstat 1
# Specific columns (bi=blocks in, bo=blocks out)vmstat 1
# Procs and memoryvmstat -m
# Slab infovmstat -S m 1 # MB unitsKey vmstat columns for I/O:
| Column | Description | Concern Value |
|---|---|---|
si | Swap in (from disk) | >0 sustained |
so | Swap out (to disk) | >0 sustained |
bi | Blocks in (disk read) | High |
bo | Blocks out (disk write) | High |
wa | I/O wait CPU | >20% |
cs | Context switches | Very high |
us | User CPU + wait | >80% |
sar - System Activity Reporter
Section titled “sar - System Activity Reporter”# Disk I/O statisticssar -d 1
# CPU and I/Osar -b 1
# All activitiessar -A 1
# Historical from logssar -d -f /var/log/sa/sa01
# Report formatsar -d 1 1 | head -20Advanced Monitoring Scripts
Section titled “Advanced Monitoring Scripts”#!/bin/bash# monitor_io.sh - Continuous I/O monitoring
while true; do clear echo "=== Disk I/O Statistics ===" date
echo -e "\n--- Device Utilization ---" iostat -x 1 1 | tail -n +7
echo -e "\n--- Top I/O Processes ---" if command -v iotop &> /dev/null; then iotop -b -o -n 1 | head -10 else ps aux | sort -k6 -r | head -10 fi
echo -e "\n--- I/O Wait ---" vmstat 1 2 | tail -1
sleep 5done53.4 I/O Scheduler Deep Dive
Section titled “53.4 I/O Scheduler Deep Dive”Available Schedulers
Section titled “Available Schedulers”┌────────────────────────────────────────────────────────────────────────┐│ I/O SCHEDULER COMPARISON │├────────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────┬────────────────────────────────────────────────┐ ││ │ Scheduler │ Best For │ ││ ├─────────────────┼────────────────────────────────────────────────┤ ││ │ mq-deadline │ Database, VM storage, mixed workloads │ ││ │ │ Low latency, prevents starvation │ ││ ├─────────────────┼────────────────────────────────────────────────┤ ││ │ bfq │ Desktop, interactive, multimedia │ ││ │ │ Fair queuing, low latency │ ││ ├─────────────────┼────────────────────────────────────────────────┤ ││ │ none │ SSD, NVMe, direct I/O │ ││ │ │ No seek penalty, pure throughput │ ││ ├─────────────────┼────────────────────────────────────────────────┤ ││ │ cfq (legacy) │ Older systems, spinning disks │ ││ │ │ Deprecated in favor of mq-deadline │ ││ └─────────────────┴────────────────────────────────────────────────┘ ││ ││ NOTE: On modern kernels (4.x+), most use blk-mq (block multi-queue) ││ which provides better performance for modern devices ││ │└────────────────────────────────────────────────────────────────────────┘Scheduler Configuration
Section titled “Scheduler Configuration”# View current scheduler (for each queue)cat /sys/block/sda/queue/schedulercat /sys/block/nvme0n1/queue/scheduler
# List all available schedulerscat /sys/block/sda/queue/scheduler# Output: [none] mq-deadline bfq cfq
# Temporarily change scheduler (runtime)echo "mq-deadline" | sudo tee /sys/block/sda/queue/scheduler
# For SSD/NVMe - use 'none'echo "none" | sudo tee /sys/block/nvme0n1/queue/scheduler
# Verify changecat /sys/block/sda/queue/schedulerPersistent Scheduler Configuration
Section titled “Persistent Scheduler Configuration”Method 1: udev Rules
# For SSD/NVMe - no scheduler neededACTION=="add|change", KERNEL=="sd[a-z]|nvme[0-9]n1", ATTR{queue/scheduler}="none"
# For HDD - use mq-deadlineACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="mq-deadline"
# Reload udev rulessudo udevadm control --reload-rulessudo udevadm triggerMethod 2: Kernel Boot Parameters
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash elevator=mq-deadline"
# Applysudo update-grubMethod 3: systemd service
[Unit]Description=Set I/O SchedulerAfter=local-fs.target
[Service]Type=oneshotExecStart=/bin/sh -c 'echo mq-deadline > /sys/block/sda/queue/scheduler'ExecStart=/bin/sh -c 'echo none > /sys/block/nvme0n1/queue/scheduler'
[Install]WantedBy=multi-user.targetScheduler-Specific Tuning
Section titled “Scheduler-Specific Tuning”# mq-deadline tuning (default usually fine)cat /sys/block/sda/queue/iosched/# read_expire (ms), write_expire (ms), fifo_batch, front_merges
# Example: Lower latency for writesecho 100 > /sys/block/sda/queue/iosched/write_expireecho 50 > /sys/block/sda/queue/iosched/read_expire
# bfq tuningcat /sys/block/sda/queue/iosched/# bfq_slice_id_async, bfq_slice_idle, bfq_max_ratio, etc.
# Disable front merges (can hurt performance)echo 0 > /sys/block/sda/queue/add_random53.5 I/O Tuning Parameters
Section titled “53.5 I/O Tuning Parameters”Virtual Memory and Dirty Pages
Section titled “Virtual Memory and Dirty Pages”# /etc/sysctl.conf - Disk I/O related parameters
# Dirty page ratiosvm.dirty_ratio = 15 # % of RAM for dirty pages (single process)vm.dirty_background_ratio = 5 # % of RAM for background writeback
# More aggressive for database serversvm.dirty_ratio = 40vm.dirty_background_ratio = 10vm.dirty_expire_centisecs = 3000 # When dirty pages are "old"vm.dirty_writeback_centisecs = 500 # Writeback interval
# Page cache tuningvm.vfs_cache_pressure = 100 # Higher = more aggressive reclaim
# Memory managementvm.min_free_kbytes = 65536 # Reserve for emergency allocations
# Apply changessudo sysctl -pBlock Device Parameters
Section titled “Block Device Parameters”# Read-ahead (default usually 128KB/256KB)blockdev --getra /dev/sda# Returns in 512-byte sectors, so 256 = 128KB
# Set read-aheadblockdev --setra 4096 /dev/sda # 2MB for sequential reads
# Permanently set (udev)# /etc/udev/rules.d/60-readahead.rulesACTION=="add|change", KERNEL=="sda", ATTR{bdi/read_ahead_kb}="4096"
# Queue depthcat /sys/block/sda/queue/nr_requests# Default 128, can increase for high-IOPS
# Request sizecat /sys/block/sda/queue/max_sectors_kb# Max request size in KB
# Enable/disable featuresecho 0 > /sys/block/sda/queue/add_random # Disableecho 0 > /sys/block/sda/queue/iostats # Disable I/O statsFilesystem Mount Options
Section titled “Filesystem Mount Options”# /etc/fstab examples
# Optimized ext4 for SSD/dev/sda1 / ext4 defaults,noatime,nodiratime,errors=remount-ro 0 1
# Optimized for data/dev/sdb1 /data ext4 defaults,noatime,nodiratime,data=writeback 0 2
# XFS for large files/dev/sdc1 /var/log xfs defaults,noatime,nodiratime,logbufs=8 0 2
# Common options:# noatime - Don't update access time (major win)# nodiratime - Don't update directory access time# relatime - Only update atime if older than mtime/ctime# data=journal - Full journaling (slower but safer)# data=writeback - Faster, less safe# barrier=0 - Disable barriers (dangerous on power loss)# nobarrier - Same as barrier=0# commit=60 - Sync every 60 seconds (default 5)Kernel Parameters Reference
Section titled “Kernel Parameters Reference”# All I/O related sysctlssysctl -a | grep -E "vm\.(dirty|writeback|vfs_cache)"
# Important parametersvm.dirty_ratio # Max dirty before synchronous writevm.dirty_background_ratio # Start background writebackvm.dirty_expire_centisecs # When dirty data is "old"vm.dirty_writeback_centisecs # Writeback daemon intervalvm.dirty_time_centisecs # Track writeback timingvm.vfs_cache_pressure # Dent/inode cache reclaim tendencyvm.min_free_kbytes # Reserved memoryvm.overcommit_memory # Memory overcommitvm.overcommit_ratio # Overcommit percentage53.6 SSD and NVMe Optimization
Section titled “53.6 SSD and NVMe Optimization”SSD-Specific Tuning
Section titled “SSD-Specific Tuning”┌────────────────────────────────────────────────────────────────────────┐│ SSD OPTIMIZATION CHECKLIST │├────────────────────────────────────────────────────────────────────────┤│ ││ □ Use 'none' I/O scheduler ││ □ Enable TRIM support ││ □ Disable access time updates (noatime) ││ □ Align partitions to 1MB (4096 sectors) ││ □ Enable discard mount option (or use fstrim.timer) ││ □ Disable journaling if safe (data=writeback) ││ □ Use ext4 or xfs (mature, well-tested) ││ □ Disable swap if enough RAM ││ □ Increase read-ahead for sequential workloads ││ │└────────────────────────────────────────────────────────────────────────┘Enabling TRIM
Section titled “Enabling TRIM”# Check if device supports TRIMhdparm -I /dev/sda | grep -i trim
# Or for NVMenvme id-ctrl /dev/nvme0 | grep -i trim
# Manual TRIMsudo fstrim -v /
# Enable weekly TRIM (recommended)sudo systemctl enable fstrim.timersudo systemctl start fstrim.timer
# Check timer statussystemctl status fstrim.timer
# fstrim mount option (mount-time TRIM)# /etc/fstab/dev/sda1 / ext4 defaults,discard 0 1# Note: discard can hurt performance on some SSDsPartition Alignment
Section titled “Partition Alignment”# Check alignment (should be 0 for optimal)parted /dev/sda(parted) align-check optimal 1
# Create aligned partitionparted /dev/sda(parted) mklabel gpt(parted) mkpart primary ext4 1MiB -1# Start at 1MiB ensures alignment
# Or use sfdisksfdisk -uM /dev/sda << 'EOF'label: gptdevice: /dev/sdaunit: sectors
/dev/sda1 : start=2048, size=524288, type=0FC63DAF-8483-4772-8E79-3D69D8477DF4EOF# 2048 sectors = 1MB aligned for 4K sectorsNVMe Optimization
Section titled “NVMe Optimization”# NVMe-specific settings# Most are already optimized by default
# Check NVMe infonvme listnvme id-ctrl /dev/nvme0
# NVMe power statenvme get-feature /dev/nvme0 -f 2 # Power state
# Set IO queue depth (if needed)echo 1024 > /sys/block/nvme0n1/queue/nr_requests
# Disable polling (interrupt is usually better)cat /sys/block/nvme0n1/queue/rotational# Should be 0
# PCIe settingslspci -vv | grep -A10 NVMe53.7 Troubleshooting I/O Issues
Section titled “53.7 Troubleshooting I/O Issues”Diagnosing High I/O Wait
Section titled “Diagnosing High I/O Wait”#!/bin/bashecho "=== High I/O Wait Diagnosis ==="echo ""
echo "--- vmstat output (check 'wa' column) ---"vmstat 1 5
echo -e "\n--- iostat (check '%util' and 'await') ---"iostat -x 1 5
echo -e "\n--- Top I/O consuming processes ---"if command -v iotop &> /dev/null; then iotop -b -n 3 -o | head -20else ps aux | sort -k6 -r | head -10fi
echo -e "\n--- I/O by filesystem ---"df -hmount | column -t
echo -e "\n--- Check for I/O errors ---"dmesg | grep -i "i/o error" | tail -20dmesg | grep -i "sd[a-z]" | tail -20
echo -e "\n--- Check for filesystem issues ---"cat /proc/mountsCommon I/O Issues and Solutions
Section titled “Common I/O Issues and Solutions”| Issue | Symptoms | Diagnosis | Solution |
|---|---|---|---|
| High I/O wait | wa > 20% in vmstat | iostat, iotop | Add I/O capacity, optimize workloads |
| High await | await > 50ms | iostat -x | Check disk health, reduce queue depth |
| Disk bottleneck | %util ~100% | iostat | Upgrade storage, add cache |
| Fragmentation | Slow sequential reads | fsck, iostat | Defragment, balance |
| Full filesystem | ENOSPC errors | df -i, df -h | Add space, clean up |
| Inode exhaustion | ENOSPC but space available | df -i | Clean small files |
| LVM issues | Slow I/O | lvs, pvs | Check thin pool, cache |
| NFS issues | High latency | nfsstat, iostat | Tune NFS options |
Finding I/O Heavy Processes
Section titled “Finding I/O Heavy Processes”# Method 1: Using iotop (requires root)sudo iotop -o -P
# Method 2: Using pidstatpidstat -d 1 | grep -v "^$"
# Method 3: From /procfor pid in $(ls /proc | grep -E '^[0-9]+$'); do if [ -f /proc/$pid/io ]; then echo "PID $pid: $(cat /proc/$pid/io 2>/dev/null | grep -E "read_bytes|write_bytes")" fidone | sort -k2 -r | head -10
# Method 4: Using sarsar -d 1 | tail -20Disk Health Monitoring
Section titled “Disk Health Monitoring”# SMART statussudo smartctl -a /dev/sdasudo smartctl -H /dev/sda
# Short SMART testsudo smartctl -t short /dev/sda
# Long SMART testsudo smartctl -t long /dev/sda
# Check SMART resultssudo smartctl -l selftest /dev/sda
# NVMe healthsudo nvme smart-log /dev/nvme0sudo nvme health-log /dev/nvme0Checking Filesystem Health
Section titled “Checking Filesystem Health”# Check filesystem (unmount first!)sudo umount /dev/sda1sudo fsck -f /dev/sda1
# Check without unmount (read-only)sudo fsck -n /dev/sda1
# xfs_repairsudo xfs_repair -n /dev/sda1 # check onlysudo xfs_repair /dev/sda1
# btrfs checksudo btrfs check --readonly /dev/sda153.8 Performance Benchmarking
Section titled “53.8 Performance Benchmarking”fio - Flexible I/O Tester
Section titled “fio - Flexible I/O Tester”# Install fiosudo apt-get install fio # Debian/Ubuntusudo yum install fio # RHEL/CentOS
# Sequential read testfio --name=seqread --ioengine=libaio --direct=1 --bs=4k --iodepth=32 --numjobs=4 --rw=read --size=1G --runtime=60 --time_based --group_reporting
# Sequential write testfio --name=seqwrite --ioengine=libaio --direct=1 --bs=4k --iodepth=32 --numjobs=4 --rw=write --size=1G --runtime=60 --time_based --group_reporting
# Random read test (4K)fio --name=randread --ioengine=libaio --direct=1 --bs=4k --iodepth=64 --numjobs=4 --rw=randread --size=1G --runtime=60 --time_based --group_reporting
# Random write test (4K)fio --name=randwrite --ioengine=libaio --direct=1 --bs=4k --iodepth=64 --numjobs=4 --rw=randwrite --size=1G --runtime=60 --time_based --group_reporting
# Mixed workload (70% read, 30% write)fio --name=mixed --ioengine=libaio --direct=1 --bs=4k --iodepth=32 --numjobs=4 --rw=randrw --rwmixread=70 --size=1G --runtime=60 --time_based --group_reporting
# Quick benchmarkfio --name=quick --ioengine=sync --direct=0 --bs=4k --numjobs=1 --rw=randread --size=100M --runtime=30 --time_basedhdparm - Simple Benchmark
Section titled “hdparm - Simple Benchmark”# Sequential read speedsudo hdparm -t /dev/sdasudo hdparm -tT /dev/sda # cached and buffer
# For multiple runsfor i in 1 2 3; do sudo hdparm -t /dev/sda; doneBonnie++ - Filesystem Benchmark
Section titled “Bonnie++ - Filesystem Benchmark”# Installsudo apt-get install bonnie++
# Run benchmarkbonnie++ -u root -d /tmp -s 2048 -n 100
# HTML reportbonnie++ -u root -d /tmp -s 4096 -f -b -n 10 | bon_csv2html > report.htmldd - Simple Test
Section titled “dd - Simple Test”# Sequential write testdd if=/dev/zero of=/tmp/testfile bs=1M count=1024 oflag=direct
# Sequential read testdd if=/tmp/testfile of=/dev/null bs=1M count=1024 iflag=direct
# With performance reportingdd if=/dev/zero of=/tmp/testfile bs=1M count=1024 oflag=direct status=progress
# Note: dd is not accurate for benchmarking, use fio for proper results53.9 Production Configuration Examples
Section titled “53.9 Production Configuration Examples”Database Server (PostgreSQL/MySQL)
Section titled “Database Server (PostgreSQL/MySQL)”# /etc/sysctl.conf additions for databasevm.dirty_ratio = 40vm.dirty_background_ratio = 10vm.dirty_expire_centisecs = 5000vm.dirty_writeback_centisecs = 500vm.swappiness = 10vm.overcommit_memory = 2vm.overcommit_ratio = 95
# Applysudo sysctl -p
# /etc/fstab for database partition/dev/sdb1 /var/lib/postgresql ext4 defaults,noatime,nodiratime,data=writeback,barrier=0 0 2
# I/O scheduler (none for SSD)echo none > /sys/block/sda/queue/scheduler
# Queue settingsecho 256 > /sys/block/sda/queue/nr_requestsecho 4096 > /sys/block/sda/queue/read_ahead_kbWeb Server / Cache Server
Section titled “Web Server / Cache Server”# /etc/sysctl.conf for web servervm.dirty_ratio = 15vm.dirty_background_ratio = 5vm.vfs_cache_pressure = 50
# Allow more open filesfs.file-max = 65536
# Applysudo sysctl -p
# /etc/security/limits.conf* soft nofile 65535* hard nofile 65535
# Mount options for webroot/dev/sda1 /var/www ext4 defaults,noatime,nodiratime,data=writeback 0 2High-Performance File Server
Section titled “High-Performance File Server”vm.dirty_ratio = 60vm.dirty_background_ratio = 20vm.dirty_expire_centisecs = 10000vm.dirty_writeback_centisecs = 1000
# Applysudo sysctl -p
# Mount options# Use XFS for large files/dev/sdc1 /exports/shares xfs defaults,noatime,nodiratime,logbufs=8,logdev=/dev/sdd1 0 2
# I/O schedulerecho mq-deadline > /sys/block/sda/queue/scheduler53.10 Interview Questions
Section titled “53.10 Interview Questions”Q1: How do you diagnose high I/O wait on a Linux system?
Section titled “Q1: How do you diagnose high I/O wait on a Linux system?”Answer:
- Use
vmstat 1to check thewa(I/O wait) column - values above 20% indicate I/O bottleneck - Use
iostat -x 1to identify which devices are saturated - check%utilnear 100% and highawaittimes - Use
iotopto find processes causing the most I/O - Check if it’s a disk capacity issue with
df -handdf -i - Verify disk health with
smartctl -a /dev/sda - Check for I/O errors in
dmesg | grep -i error
Q2: What is the difference between mq-deadline, bfq, and none I/O schedulers?
Section titled “Q2: What is the difference between mq-deadline, bfq, and none I/O schedulers?”Answer:
- mq-deadline: Best for most workloads including databases and VMs. Implements deadline-based scheduling to prevent request starvation, merges adjacent requests, sorts by sector for disk efficiency
- bfq (Budget Fair Queuing): Best for desktop/interactive workloads. Provides fair queuing to prevent a single process from monopolizing I/O, better latency for interactive applications
- none: Best for SSD/NVMe devices that have no seek penalty. Bypasses scheduler overhead entirely, lets the device’s internal queue handle optimization
Q3: How do you optimize an SSD for better performance?
Section titled “Q3: How do you optimize an SSD for better performance?”Answer:
- Use
noneI/O scheduler (no seek penalty) - Enable TRIM via
fstrim.timeror mount optiondiscard - Use
noatime,nodiratimemount options to avoid unnecessary writes - Ensure partitions are aligned to 1MB boundaries
- Disable or reduce swap if sufficient RAM available
- Use modern filesystems like ext4 or xfs with appropriate options
- Increase read-ahead for sequential workloads:
blockdev --setra 4096 /dev/sda
Q4: What is the difference between synchronous and asynchronous I/O?
Section titled “Q4: What is the difference between synchronous and asynchronous I/O?”Answer:
- Synchronous I/O: Process blocks until I/O completes. Simpler programming model, ensures data is on disk before continuing. Default for most applications. Can limit throughput.
- Asynchronous I/O: Process continues while I/O proceeds in background. Higher throughput for concurrent operations. More complex programming. Examples:
io_submit(),aio_read(),io_uring(modern)
Applications can use O_SYNC for synchronous writes, or O_DIRECT for direct I/O bypassing page cache.
Q5: How does the Linux page cache work and how does it affect I/O performance?
Section titled “Q5: How does the Linux page cache work and how does it affect I/O performance?”Answer:
- The page cache stores recently accessed disk data in RAM
- Reads are served from cache if available (very fast)
- Writes go to page cache first (write-back), then flushed to disk
- This dramatically improves performance for repeated reads
- The
dirty_ratioanddirty_background_ratiocontrol when data is written vm.vfs_cache_pressurecontrols how aggressively dentry/inode cache is reclaimed- Can be bypassed with
O_DIRECTflag for applications needing explicit control
Q6: What is the relationship between ionice and I/O scheduling?
Section titled “Q6: What is the relationship between ionice and I/O scheduling?”Answer:
ionice sets the I/O scheduling class and priority for processes:
- Real-time (class 1): Highest priority, can starve other processes
- Best-effort (class 2): Default, priority 0-7
- Idle (class 3): Only runs when disk is idle
Examples:
# Run rsync at idle priorityionice -c 3 -p $(pgrep rsync)
# Run database at real-timeionice -c 1 -n 0 -p $(pgrep mysql)
# Run with best-effort low priorityionice -c 2 -n 7 -p $(pgrep tar)Q7: How do you identify and resolve I/O bottlenecks in a production environment?
Section titled “Q7: How do you identify and resolve I/O bottlenecks in a production environment?”Answer:
- Identify: Use
iostat -xto find devices with high%utilandawait - Analyze: Use
iotopto find culprit processes - Check: Verify disk health, capacity, and fragmentation
- Optimize:
- Add cache (L2ARC, bcache, dm-cache)
- Use faster storage (SSD, NVMe)
- Tune I/O scheduler
- Adjust dirty page parameters
- Use appropriate filesystem options
- Consider load distribution (RAID, distributed storage)
- Monitor: Set up alerting for I/O metrics
Q8: Explain the difference between write-through and write-back caching.
Section titled “Q8: Explain the difference between write-through and write-back caching.”Answer:
- Write-through: Data is written to cache AND to storage simultaneously. Slower writes but never loses data on power failure. Used when data integrity is critical.
- Write-back: Data is written to cache first, then asynchronously flushed to storage. Much faster writes but risk of data loss on crash. This is the default in Linux.
Configure write-back with barrier=0 mount option (dangerous), or write-through with barrier=1 (default in ext4 with journal).
Quick Reference
Section titled “Quick Reference”Essential Commands
Section titled “Essential Commands”# Monitoringiostat -x 1 # I/O statsiotop # Per-process I/Ovmstat 1 # System statspidstat -d 1 # Process I/O
# Configurationcat /sys/block/*/queue/scheduler # View schedulerecho none > /sys/block/sda/queue/scheduler # Set schedulerblockdev --setra 4096 /dev/sda # Set read-ahead
# Filesystemmount -o noatime,nodiratime /dev/sda1 /mnttune2fs -o noatime /dev/sda1
# Healthsmartctl -a /dev/sdafstrim -v /
# Benchmarkfio --name=test --rw=randread --size=1G --runtime=30Key Metrics
Section titled “Key Metrics”| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| %util | <50% | 50-80% | >80% |
| await | <10ms | 10-50ms | >50ms |
| wa (vmstat) | <10% | 10-30% | >30% |
| aqu-sz | <2 | 2-10 | >10 |
sysctl Parameters
Section titled “sysctl Parameters”vm.dirty_ratio # 15-40%vm.dirty_background_ratio # 5-10%vm.dirty_expire_centisecs # 3000-5000vm.dirty_writeback_centisecs # 500-1000vm.vfs_cache_pressure # 50-100vm.swappiness # 10-60Summary
Section titled “Summary”In this chapter, you learned:
- ✅ Linux I/O stack architecture and request flow
- ✅ Disk information and identification commands
- ✅ I/O monitoring with iostat, iotop, pidstat, vmstat
- ✅ I/O scheduler comparison and configuration
- ✅ Virtual memory and filesystem tuning parameters
- ✅ SSD and NVMe-specific optimizations
- ✅ Troubleshooting high I/O wait and bottlenecks
- ✅ Performance benchmarking with fio
- ✅ Production configuration examples
- ✅ Interview questions and answers
Next Chapter
Section titled “Next Chapter”Chapter 54: Network Performance
Last Updated: February 2026