Skip to content

Disk_io_performance

Disk I/O performance is critical for system responsiveness and application throughput. Understanding how Linux handles disk operations, from the physical device to the filesystem, is essential for troubleshooting performance issues and optimizing storage subsystems.

This chapter covers disk I/O performance monitoring, tuning, troubleshooting, and optimization techniques for both traditional spinning disks (HDD) and solid-state drives (SSD/NVMe) in production environments.


┌─────────────────────────────────────────────────────────────────────────┐
│ LINUX I/O STACK │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Application │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ SYSTEM CALLS │ │
│ │ open(), read(), write(), lseek(), fsync(), mmap() │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ VIRTUAL FILE SYSTEM (VFS) │ │
│ │ Provides unified interface for filesystems │ │
│ │ dentry cache, inode cache, page cache │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ SPECIFIC FILESYSTEM │ │
│ │ ext4, xfs, btrfs, zfs, ntfs │ │
│ │ - Metadata operations │ │
│ │ - Data journaling/buffering │ │
│ │ - Allocation algorithms │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ BLOCK LAYER │ │
│ │ I/O Scheduler (mq-deadline, bfq, none) │ │
│ │ Request merging, ordering, prioritization │ │
│ │ Multi-queue support (blk-mq) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ DEVICE DRIVERS │ │
│ │ SCSI, SATA, NVMe, IDE, MMC, virtio-blk │ │
│ │ - Command queuing │ │
│ │ - Error recovery │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ PHYSICAL DEVICE │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ HDD │ │ SSD │ │ NVMe │ │ │
│ │ │ (spinning) │ │ (SATA/NVMe)│ │ (PCIe Gen4) │ │ │
│ │ │ 7,200 RPM │ │ ~500K IOPS│ │ ~1M+ IOPS │ │ │
│ │ │ ~100 IOPS │ │ ~500 MB/s │ │ ~7 GB/s │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────┐
│ I/O REQUEST LIFECYCLE │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. APPLICATION │
│ read(fd, buffer, 4096) │
│ │ │
│ ▼ │
│ 2. VFS - Check Page Cache │
│ ┌─────────────────────────────────────┐ │
│ │ Page Cache Hit? │ │
│ │ YES ──► Return cached data │ │
│ │ NO ──► Generate bio request │ │
│ └─────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 3. FILESYSTEM LAYER │
│ - Translate file offset to block number │
│ - Handle journaling (if applicable) │
│ - Update metadata │
│ │ │
│ ▼ │
│ 4. BLOCK LAYER │
│ ┌──────────────────────────────────────────┐ │
│ │ I/O Scheduler │ │
│ │ - Merge adjacent requests │ │
│ │ - Sort by sector number │ │
│ │ - Apply QoS policies │ │
│ │ - Batch for throughput │ │
│ └──────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ 5. DEVICE DRIVER │
│ - Build device command │
│ - Queue in hardware (NCQ, SQH) │
│ - Submit to controller │
│ │ │
│ ▼ │
│ 6. PHYSICAL DEVICE │
│ - Mechanical seek (HDD) / Flash read (SSD) │
│ - Data transfer │
│ │ │
│ ▼ │
│ 7. COMPLETION │
│ - Interrupt通知 │
│ - Update page cache │
│ - Wake waiting process │
│ │
└────────────────────────────────────────────────────────────────────────┘

Terminal window
# List all block devices with details
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,ROTA,MODEL
# ROTA=1 for spinning disks, ROTA=0 for SSD/NVMe
# Detailed block device information
lsblk -f
# Block device attributes
blkid
# Detailed device information
ls -la /sys/block/
ls -la /sys/block/sda/
# Get device size in bytes
blockdev --getsize64 /dev/sda
# Get device sector count
blockdev --getsz /dev/sda
# Check device model and vendor
cat /sys/block/sda/device/model
cat /sys/block/sda/device/vendor
cat /sys/block/sda/device/rev
Terminal window
# Filesystem usage (human-readable)
df -h
# Inode usage (important for many small files)
df -i
# Directory sizes (sorted)
du -sh /var/* 2>/dev/null | sort -hr
du -h --max-depth=1 / | sort -hr
# Largest files in directory
find /var -type f -exec du -h {} + 2>/dev/null | sort -rh | head -20
# Disk usage by file type
du -sh /var/log/*.log
du -sh /var/*/tmp
# Real-time disk usage monitoring
watch -n 1 'df -h /'
# UUIDs and labels
blkid
ls -la /dev/disk/by-uuid/
ls -la /dev/disk/by-label/
Terminal window
# MBR partition table
fdisk -l /dev/sda
# GPT partition table (modern)
gdisk -l /dev/sda
# Parted (advanced)
parted -l
# Show partition UUIDs
blkid | grep /dev/sda
# Partition details from /proc
cat /proc/partitions
# Partition alignment check
parted /dev/sda align-check opt 1

Terminal window
# Extended statistics (most useful)
iostat -x 1
# Per-device statistics
iostat -x sda 1
# CPU and device statistics
iostat -c -x 1
# Show kilobytes (better for large I/O)
iostat -xk 1
# Report utilization with timestamps
iostat -t -x 1
# Custom interval with count
iostat -x 5 3 # 3 reports, 5-second intervals

Key iostat Metrics Explained:

MetricDescriptionHealthy ValueConcern Value
r/sRead requests completed/secVaries by workload>1000 sustained
w/sWrite requests completed/secVaries by workload>1000 sustained
rkB/sKilobytes read/secDepends on workloadN/A
wkB/sKilobytes written/secDepends on workloadN/A
awaitAverage I/O wait time (ms)<10ms>50ms
%utilDevice utilization<70%>90% sustained
svctmAverage service time (ms)<10ms>20ms
aqu-szAverage queue length<2>10
rrqm/sRead requests merged/secModerateVery high
wrqm/sWrite requests merged/secModerateVery high
Terminal window
# Interactive I/O monitoring (requires root)
iotop
# Show only active I/O processes
iotop -o
# Show accumulated I/O instead of rate
iotop -a
# Batch mode (useful for logging)
iotop -b -o -n 3 # 3 iterations
# Show processes doing I/O
iotop -P
# Show total I/O
iotop -o -P
# Keyboard shortcuts in iotop:
# o - show only active I/O
# p - show processes
# a - show accumulated I/O
# q - quit
Terminal window
# I/O by process (1-second intervals)
pidstat -d 1
# I/O by PID
pidstat -d -p 1234 1
# Extended I/O statistics
pidstat -d -x 1
# Summary since boot
pidstat -d -h 1
Terminal window
# All system statistics
vmstat 1
# Specific columns (bi=blocks in, bo=blocks out)
vmstat 1
# Procs and memory
vmstat -m
# Slab info
vmstat -S m 1 # MB units

Key vmstat columns for I/O:

ColumnDescriptionConcern Value
siSwap in (from disk)>0 sustained
soSwap out (to disk)>0 sustained
biBlocks in (disk read)High
boBlocks out (disk write)High
waI/O wait CPU>20%
csContext switchesVery high
usUser CPU + wait>80%
Terminal window
# Disk I/O statistics
sar -d 1
# CPU and I/O
sar -b 1
# All activities
sar -A 1
# Historical from logs
sar -d -f /var/log/sa/sa01
# Report format
sar -d 1 1 | head -20
#!/bin/bash
# monitor_io.sh - Continuous I/O monitoring
while true; do
clear
echo "=== Disk I/O Statistics ==="
date
echo -e "\n--- Device Utilization ---"
iostat -x 1 1 | tail -n +7
echo -e "\n--- Top I/O Processes ---"
if command -v iotop &> /dev/null; then
iotop -b -o -n 1 | head -10
else
ps aux | sort -k6 -r | head -10
fi
echo -e "\n--- I/O Wait ---"
vmstat 1 2 | tail -1
sleep 5
done

┌────────────────────────────────────────────────────────────────────────┐
│ I/O SCHEDULER COMPARISON │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┬────────────────────────────────────────────────┐ │
│ │ Scheduler │ Best For │ │
│ ├─────────────────┼────────────────────────────────────────────────┤ │
│ │ mq-deadline │ Database, VM storage, mixed workloads │ │
│ │ │ Low latency, prevents starvation │ │
│ ├─────────────────┼────────────────────────────────────────────────┤ │
│ │ bfq │ Desktop, interactive, multimedia │ │
│ │ │ Fair queuing, low latency │ │
│ ├─────────────────┼────────────────────────────────────────────────┤ │
│ │ none │ SSD, NVMe, direct I/O │ │
│ │ │ No seek penalty, pure throughput │ │
│ ├─────────────────┼────────────────────────────────────────────────┤ │
│ │ cfq (legacy) │ Older systems, spinning disks │ │
│ │ │ Deprecated in favor of mq-deadline │ │
│ └─────────────────┴────────────────────────────────────────────────┘ │
│ │
│ NOTE: On modern kernels (4.x+), most use blk-mq (block multi-queue) │
│ which provides better performance for modern devices │
│ │
└────────────────────────────────────────────────────────────────────────┘
Terminal window
# View current scheduler (for each queue)
cat /sys/block/sda/queue/scheduler
cat /sys/block/nvme0n1/queue/scheduler
# List all available schedulers
cat /sys/block/sda/queue/scheduler
# Output: [none] mq-deadline bfq cfq
# Temporarily change scheduler (runtime)
echo "mq-deadline" | sudo tee /sys/block/sda/queue/scheduler
# For SSD/NVMe - use 'none'
echo "none" | sudo tee /sys/block/nvme0n1/queue/scheduler
# Verify change
cat /sys/block/sda/queue/scheduler

Method 1: udev Rules

/etc/udev/rules.d/60-ssd-scheduler.rules
# For SSD/NVMe - no scheduler needed
ACTION=="add|change", KERNEL=="sd[a-z]|nvme[0-9]n1", ATTR{queue/scheduler}="none"
# For HDD - use mq-deadline
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="mq-deadline"
# Reload udev rules
sudo udevadm control --reload-rules
sudo udevadm trigger

Method 2: Kernel Boot Parameters

/etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash elevator=mq-deadline"
# Apply
sudo update-grub

Method 3: systemd service

/etc/systemd/system/set-io-scheduler.service
[Unit]
Description=Set I/O Scheduler
After=local-fs.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo mq-deadline > /sys/block/sda/queue/scheduler'
ExecStart=/bin/sh -c 'echo none > /sys/block/nvme0n1/queue/scheduler'
[Install]
WantedBy=multi-user.target
Terminal window
# mq-deadline tuning (default usually fine)
cat /sys/block/sda/queue/iosched/
# read_expire (ms), write_expire (ms), fifo_batch, front_merges
# Example: Lower latency for writes
echo 100 > /sys/block/sda/queue/iosched/write_expire
echo 50 > /sys/block/sda/queue/iosched/read_expire
# bfq tuning
cat /sys/block/sda/queue/iosched/
# bfq_slice_id_async, bfq_slice_idle, bfq_max_ratio, etc.
# Disable front merges (can hurt performance)
echo 0 > /sys/block/sda/queue/add_random

Terminal window
# /etc/sysctl.conf - Disk I/O related parameters
# Dirty page ratios
vm.dirty_ratio = 15 # % of RAM for dirty pages (single process)
vm.dirty_background_ratio = 5 # % of RAM for background writeback
# More aggressive for database servers
vm.dirty_ratio = 40
vm.dirty_background_ratio = 10
vm.dirty_expire_centisecs = 3000 # When dirty pages are "old"
vm.dirty_writeback_centisecs = 500 # Writeback interval
# Page cache tuning
vm.vfs_cache_pressure = 100 # Higher = more aggressive reclaim
# Memory management
vm.min_free_kbytes = 65536 # Reserve for emergency allocations
# Apply changes
sudo sysctl -p
Terminal window
# Read-ahead (default usually 128KB/256KB)
blockdev --getra /dev/sda
# Returns in 512-byte sectors, so 256 = 128KB
# Set read-ahead
blockdev --setra 4096 /dev/sda # 2MB for sequential reads
# Permanently set (udev)
# /etc/udev/rules.d/60-readahead.rules
ACTION=="add|change", KERNEL=="sda", ATTR{bdi/read_ahead_kb}="4096"
# Queue depth
cat /sys/block/sda/queue/nr_requests
# Default 128, can increase for high-IOPS
# Request size
cat /sys/block/sda/queue/max_sectors_kb
# Max request size in KB
# Enable/disable features
echo 0 > /sys/block/sda/queue/add_random # Disable
echo 0 > /sys/block/sda/queue/iostats # Disable I/O stats
Terminal window
# /etc/fstab examples
# Optimized ext4 for SSD
/dev/sda1 / ext4 defaults,noatime,nodiratime,errors=remount-ro 0 1
# Optimized for data
/dev/sdb1 /data ext4 defaults,noatime,nodiratime,data=writeback 0 2
# XFS for large files
/dev/sdc1 /var/log xfs defaults,noatime,nodiratime,logbufs=8 0 2
# Common options:
# noatime - Don't update access time (major win)
# nodiratime - Don't update directory access time
# relatime - Only update atime if older than mtime/ctime
# data=journal - Full journaling (slower but safer)
# data=writeback - Faster, less safe
# barrier=0 - Disable barriers (dangerous on power loss)
# nobarrier - Same as barrier=0
# commit=60 - Sync every 60 seconds (default 5)
Terminal window
# All I/O related sysctls
sysctl -a | grep -E "vm\.(dirty|writeback|vfs_cache)"
# Important parameters
vm.dirty_ratio # Max dirty before synchronous write
vm.dirty_background_ratio # Start background writeback
vm.dirty_expire_centisecs # When dirty data is "old"
vm.dirty_writeback_centisecs # Writeback daemon interval
vm.dirty_time_centisecs # Track writeback timing
vm.vfs_cache_pressure # Dent/inode cache reclaim tendency
vm.min_free_kbytes # Reserved memory
vm.overcommit_memory # Memory overcommit
vm.overcommit_ratio # Overcommit percentage

┌────────────────────────────────────────────────────────────────────────┐
│ SSD OPTIMIZATION CHECKLIST │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ □ Use 'none' I/O scheduler │
│ □ Enable TRIM support │
│ □ Disable access time updates (noatime) │
│ □ Align partitions to 1MB (4096 sectors) │
│ □ Enable discard mount option (or use fstrim.timer) │
│ □ Disable journaling if safe (data=writeback) │
│ □ Use ext4 or xfs (mature, well-tested) │
│ □ Disable swap if enough RAM │
│ □ Increase read-ahead for sequential workloads │
│ │
└────────────────────────────────────────────────────────────────────────┘
Terminal window
# Check if device supports TRIM
hdparm -I /dev/sda | grep -i trim
# Or for NVMe
nvme id-ctrl /dev/nvme0 | grep -i trim
# Manual TRIM
sudo fstrim -v /
# Enable weekly TRIM (recommended)
sudo systemctl enable fstrim.timer
sudo systemctl start fstrim.timer
# Check timer status
systemctl status fstrim.timer
# fstrim mount option (mount-time TRIM)
# /etc/fstab
/dev/sda1 / ext4 defaults,discard 0 1
# Note: discard can hurt performance on some SSDs
Terminal window
# Check alignment (should be 0 for optimal)
parted /dev/sda
(parted) align-check optimal 1
# Create aligned partition
parted /dev/sda
(parted) mklabel gpt
(parted) mkpart primary ext4 1MiB -1
# Start at 1MiB ensures alignment
# Or use sfdisk
sfdisk -uM /dev/sda << 'EOF'
label: gpt
device: /dev/sda
unit: sectors
/dev/sda1 : start=2048, size=524288, type=0FC63DAF-8483-4772-8E79-3D69D8477DF4
EOF
# 2048 sectors = 1MB aligned for 4K sectors
Terminal window
# NVMe-specific settings
# Most are already optimized by default
# Check NVMe info
nvme list
nvme id-ctrl /dev/nvme0
# NVMe power state
nvme get-feature /dev/nvme0 -f 2 # Power state
# Set IO queue depth (if needed)
echo 1024 > /sys/block/nvme0n1/queue/nr_requests
# Disable polling (interrupt is usually better)
cat /sys/block/nvme0n1/queue/rotational
# Should be 0
# PCIe settings
lspci -vv | grep -A10 NVMe

diagnose_high_iowait.sh
#!/bin/bash
echo "=== High I/O Wait Diagnosis ==="
echo ""
echo "--- vmstat output (check 'wa' column) ---"
vmstat 1 5
echo -e "\n--- iostat (check '%util' and 'await') ---"
iostat -x 1 5
echo -e "\n--- Top I/O consuming processes ---"
if command -v iotop &> /dev/null; then
iotop -b -n 3 -o | head -20
else
ps aux | sort -k6 -r | head -10
fi
echo -e "\n--- I/O by filesystem ---"
df -h
mount | column -t
echo -e "\n--- Check for I/O errors ---"
dmesg | grep -i "i/o error" | tail -20
dmesg | grep -i "sd[a-z]" | tail -20
echo -e "\n--- Check for filesystem issues ---"
cat /proc/mounts
IssueSymptomsDiagnosisSolution
High I/O waitwa > 20% in vmstatiostat, iotopAdd I/O capacity, optimize workloads
High awaitawait > 50msiostat -xCheck disk health, reduce queue depth
Disk bottleneck%util ~100%iostatUpgrade storage, add cache
FragmentationSlow sequential readsfsck, iostatDefragment, balance
Full filesystemENOSPC errorsdf -i, df -hAdd space, clean up
Inode exhaustionENOSPC but space availabledf -iClean small files
LVM issuesSlow I/Olvs, pvsCheck thin pool, cache
NFS issuesHigh latencynfsstat, iostatTune NFS options
Terminal window
# Method 1: Using iotop (requires root)
sudo iotop -o -P
# Method 2: Using pidstat
pidstat -d 1 | grep -v "^$"
# Method 3: From /proc
for pid in $(ls /proc | grep -E '^[0-9]+$'); do
if [ -f /proc/$pid/io ]; then
echo "PID $pid: $(cat /proc/$pid/io 2>/dev/null | grep -E "read_bytes|write_bytes")"
fi
done | sort -k2 -r | head -10
# Method 4: Using sar
sar -d 1 | tail -20
Terminal window
# SMART status
sudo smartctl -a /dev/sda
sudo smartctl -H /dev/sda
# Short SMART test
sudo smartctl -t short /dev/sda
# Long SMART test
sudo smartctl -t long /dev/sda
# Check SMART results
sudo smartctl -l selftest /dev/sda
# NVMe health
sudo nvme smart-log /dev/nvme0
sudo nvme health-log /dev/nvme0
Terminal window
# Check filesystem (unmount first!)
sudo umount /dev/sda1
sudo fsck -f /dev/sda1
# Check without unmount (read-only)
sudo fsck -n /dev/sda1
# xfs_repair
sudo xfs_repair -n /dev/sda1 # check only
sudo xfs_repair /dev/sda1
# btrfs check
sudo btrfs check --readonly /dev/sda1

Terminal window
# Install fio
sudo apt-get install fio # Debian/Ubuntu
sudo yum install fio # RHEL/CentOS
# Sequential read test
fio --name=seqread --ioengine=libaio --direct=1 --bs=4k --iodepth=32 --numjobs=4 --rw=read --size=1G --runtime=60 --time_based --group_reporting
# Sequential write test
fio --name=seqwrite --ioengine=libaio --direct=1 --bs=4k --iodepth=32 --numjobs=4 --rw=write --size=1G --runtime=60 --time_based --group_reporting
# Random read test (4K)
fio --name=randread --ioengine=libaio --direct=1 --bs=4k --iodepth=64 --numjobs=4 --rw=randread --size=1G --runtime=60 --time_based --group_reporting
# Random write test (4K)
fio --name=randwrite --ioengine=libaio --direct=1 --bs=4k --iodepth=64 --numjobs=4 --rw=randwrite --size=1G --runtime=60 --time_based --group_reporting
# Mixed workload (70% read, 30% write)
fio --name=mixed --ioengine=libaio --direct=1 --bs=4k --iodepth=32 --numjobs=4 --rw=randrw --rwmixread=70 --size=1G --runtime=60 --time_based --group_reporting
# Quick benchmark
fio --name=quick --ioengine=sync --direct=0 --bs=4k --numjobs=1 --rw=randread --size=100M --runtime=30 --time_based
Terminal window
# Sequential read speed
sudo hdparm -t /dev/sda
sudo hdparm -tT /dev/sda # cached and buffer
# For multiple runs
for i in 1 2 3; do sudo hdparm -t /dev/sda; done
Terminal window
# Install
sudo apt-get install bonnie++
# Run benchmark
bonnie++ -u root -d /tmp -s 2048 -n 100
# HTML report
bonnie++ -u root -d /tmp -s 4096 -f -b -n 10 | bon_csv2html > report.html
Terminal window
# Sequential write test
dd if=/dev/zero of=/tmp/testfile bs=1M count=1024 oflag=direct
# Sequential read test
dd if=/tmp/testfile of=/dev/null bs=1M count=1024 iflag=direct
# With performance reporting
dd if=/dev/zero of=/tmp/testfile bs=1M count=1024 oflag=direct status=progress
# Note: dd is not accurate for benchmarking, use fio for proper results

Terminal window
# /etc/sysctl.conf additions for database
vm.dirty_ratio = 40
vm.dirty_background_ratio = 10
vm.dirty_expire_centisecs = 5000
vm.dirty_writeback_centisecs = 500
vm.swappiness = 10
vm.overcommit_memory = 2
vm.overcommit_ratio = 95
# Apply
sudo sysctl -p
# /etc/fstab for database partition
/dev/sdb1 /var/lib/postgresql ext4 defaults,noatime,nodiratime,data=writeback,barrier=0 0 2
# I/O scheduler (none for SSD)
echo none > /sys/block/sda/queue/scheduler
# Queue settings
echo 256 > /sys/block/sda/queue/nr_requests
echo 4096 > /sys/block/sda/queue/read_ahead_kb
Terminal window
# /etc/sysctl.conf for web server
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5
vm.vfs_cache_pressure = 50
# Allow more open files
fs.file-max = 65536
# Apply
sudo sysctl -p
# /etc/security/limits.conf
* soft nofile 65535
* hard nofile 65535
# Mount options for webroot
/dev/sda1 /var/www ext4 defaults,noatime,nodiratime,data=writeback 0 2
/etc/sysctl.conf
vm.dirty_ratio = 60
vm.dirty_background_ratio = 20
vm.dirty_expire_centisecs = 10000
vm.dirty_writeback_centisecs = 1000
# Apply
sudo sysctl -p
# Mount options
# Use XFS for large files
/dev/sdc1 /exports/shares xfs defaults,noatime,nodiratime,logbufs=8,logdev=/dev/sdd1 0 2
# I/O scheduler
echo mq-deadline > /sys/block/sda/queue/scheduler

Q1: How do you diagnose high I/O wait on a Linux system?

Section titled “Q1: How do you diagnose high I/O wait on a Linux system?”

Answer:

  1. Use vmstat 1 to check the wa (I/O wait) column - values above 20% indicate I/O bottleneck
  2. Use iostat -x 1 to identify which devices are saturated - check %util near 100% and high await times
  3. Use iotop to find processes causing the most I/O
  4. Check if it’s a disk capacity issue with df -h and df -i
  5. Verify disk health with smartctl -a /dev/sda
  6. Check for I/O errors in dmesg | grep -i error

Q2: What is the difference between mq-deadline, bfq, and none I/O schedulers?

Section titled “Q2: What is the difference between mq-deadline, bfq, and none I/O schedulers?”

Answer:

  • mq-deadline: Best for most workloads including databases and VMs. Implements deadline-based scheduling to prevent request starvation, merges adjacent requests, sorts by sector for disk efficiency
  • bfq (Budget Fair Queuing): Best for desktop/interactive workloads. Provides fair queuing to prevent a single process from monopolizing I/O, better latency for interactive applications
  • none: Best for SSD/NVMe devices that have no seek penalty. Bypasses scheduler overhead entirely, lets the device’s internal queue handle optimization

Q3: How do you optimize an SSD for better performance?

Section titled “Q3: How do you optimize an SSD for better performance?”

Answer:

  1. Use none I/O scheduler (no seek penalty)
  2. Enable TRIM via fstrim.timer or mount option discard
  3. Use noatime,nodiratime mount options to avoid unnecessary writes
  4. Ensure partitions are aligned to 1MB boundaries
  5. Disable or reduce swap if sufficient RAM available
  6. Use modern filesystems like ext4 or xfs with appropriate options
  7. Increase read-ahead for sequential workloads: blockdev --setra 4096 /dev/sda

Q4: What is the difference between synchronous and asynchronous I/O?

Section titled “Q4: What is the difference between synchronous and asynchronous I/O?”

Answer:

  • Synchronous I/O: Process blocks until I/O completes. Simpler programming model, ensures data is on disk before continuing. Default for most applications. Can limit throughput.
  • Asynchronous I/O: Process continues while I/O proceeds in background. Higher throughput for concurrent operations. More complex programming. Examples: io_submit(), aio_read(), io_uring (modern)

Applications can use O_SYNC for synchronous writes, or O_DIRECT for direct I/O bypassing page cache.

Q5: How does the Linux page cache work and how does it affect I/O performance?

Section titled “Q5: How does the Linux page cache work and how does it affect I/O performance?”

Answer:

  • The page cache stores recently accessed disk data in RAM
  • Reads are served from cache if available (very fast)
  • Writes go to page cache first (write-back), then flushed to disk
  • This dramatically improves performance for repeated reads
  • The dirty_ratio and dirty_background_ratio control when data is written
  • vm.vfs_cache_pressure controls how aggressively dentry/inode cache is reclaimed
  • Can be bypassed with O_DIRECT flag for applications needing explicit control

Q6: What is the relationship between ionice and I/O scheduling?

Section titled “Q6: What is the relationship between ionice and I/O scheduling?”

Answer: ionice sets the I/O scheduling class and priority for processes:

  • Real-time (class 1): Highest priority, can starve other processes
  • Best-effort (class 2): Default, priority 0-7
  • Idle (class 3): Only runs when disk is idle

Examples:

Terminal window
# Run rsync at idle priority
ionice -c 3 -p $(pgrep rsync)
# Run database at real-time
ionice -c 1 -n 0 -p $(pgrep mysql)
# Run with best-effort low priority
ionice -c 2 -n 7 -p $(pgrep tar)

Q7: How do you identify and resolve I/O bottlenecks in a production environment?

Section titled “Q7: How do you identify and resolve I/O bottlenecks in a production environment?”

Answer:

  1. Identify: Use iostat -x to find devices with high %util and await
  2. Analyze: Use iotop to find culprit processes
  3. Check: Verify disk health, capacity, and fragmentation
  4. Optimize:
    • Add cache (L2ARC, bcache, dm-cache)
    • Use faster storage (SSD, NVMe)
    • Tune I/O scheduler
    • Adjust dirty page parameters
    • Use appropriate filesystem options
    • Consider load distribution (RAID, distributed storage)
  5. Monitor: Set up alerting for I/O metrics

Q8: Explain the difference between write-through and write-back caching.

Section titled “Q8: Explain the difference between write-through and write-back caching.”

Answer:

  • Write-through: Data is written to cache AND to storage simultaneously. Slower writes but never loses data on power failure. Used when data integrity is critical.
  • Write-back: Data is written to cache first, then asynchronously flushed to storage. Much faster writes but risk of data loss on crash. This is the default in Linux.

Configure write-back with barrier=0 mount option (dangerous), or write-through with barrier=1 (default in ext4 with journal).


Terminal window
# Monitoring
iostat -x 1 # I/O stats
iotop # Per-process I/O
vmstat 1 # System stats
pidstat -d 1 # Process I/O
# Configuration
cat /sys/block/*/queue/scheduler # View scheduler
echo none > /sys/block/sda/queue/scheduler # Set scheduler
blockdev --setra 4096 /dev/sda # Set read-ahead
# Filesystem
mount -o noatime,nodiratime /dev/sda1 /mnt
tune2fs -o noatime /dev/sda1
# Health
smartctl -a /dev/sda
fstrim -v /
# Benchmark
fio --name=test --rw=randread --size=1G --runtime=30
MetricHealthyWarningCritical
%util<50%50-80%>80%
await<10ms10-50ms>50ms
wa (vmstat)<10%10-30%>30%
aqu-sz<22-10>10
Terminal window
vm.dirty_ratio # 15-40%
vm.dirty_background_ratio # 5-10%
vm.dirty_expire_centisecs # 3000-5000
vm.dirty_writeback_centisecs # 500-1000
vm.vfs_cache_pressure # 50-100
vm.swappiness # 10-60

In this chapter, you learned:

  • ✅ Linux I/O stack architecture and request flow
  • ✅ Disk information and identification commands
  • ✅ I/O monitoring with iostat, iotop, pidstat, vmstat
  • ✅ I/O scheduler comparison and configuration
  • ✅ Virtual memory and filesystem tuning parameters
  • ✅ SSD and NVMe-specific optimizations
  • ✅ Troubleshooting high I/O wait and bottlenecks
  • ✅ Performance benchmarking with fio
  • ✅ Production configuration examples
  • ✅ Interview questions and answers

Chapter 54: Network Performance


Last Updated: February 2026