Hardware_diagnostics

Chapter 90: Hardware Diagnostics

Overview

Hardware diagnostics are critical for Linux system administrators to identify failing components, optimize performance, and prevent hardware-related outages. This chapter covers comprehensive diagnostics for CPUs, memory, storage, network, and other hardware components. For production environments, understanding hardware health is essential for maintaining system reliability and troubleshooting intermittent issues that may not be apparent through software-level monitoring alone.

90.1 CPU Diagnostics

CPU Information

┌─────────────────────────────────────────────────────────────────────────┐
│                         CPU DIAGNOSTICS FLOW                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                    CPU HEALTH CHECKS                           │   │
│   ├─────────────────────────────────────────────────────────────────┤   │
│   │                                                                  │   │
│   │  1. Basic Info: lscpu, /proc/cpuinfo                          │   │
│   │     - Model, cores, threads, frequency                        │   │
│   │     - Cache sizes, flags                                      │   │
│   │                                                                  │   │
│   │  2. Current Status: top, htop                                │   │
│   │     - Per-core usage, load average                            │   │
│   │     - Per-process CPU usage                                   │   │
│   │                                                                  │   │
│   │  3. Temperature: sensors, turbostat                         │   │
│   │     - Core temperatures, thermal throttling                   │   │
│   │                                                                  │   │
│   │  4. Performance: perf, stress                               │   │
│   │     - Event counting, stress testing                          │   │
│   │                                                                  │   │
│   │  5. Errors: dmesg, mcelog                                    │   │
│   │     - Hardware errors, machine checks                          │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

# ============================================================
# CPU DIAGNOSTICS COMMANDS
# ============================================================

# Basic CPU information
lscpu                                    # Detailed CPU info
lscpu -e                                # CPU topology
cat /proc/cpuinfo                       # Detailed CPU features

# Example lscpu output analysis:
# Architecture:        x86_64
# CPU op-mode(s):      32-bit, 64-bit
# Model name:          Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
# CPU(s):              16              (8 cores × 2 threads per core)
# Thread(s) per core:  2
# Core(s) per socket:  8
# Socket(s):           1
# NUMA node(s):        1
# CPU MHz:             1200.000        (base frequency)
# CPU max MHz:         3300.0000       (turbo boost)
# Cache size:          20480 KB        (L3 cache)
# Flags:               fpu vme de pse tsc msr pae mce cx8 ...

# CPU frequency information
cpupower frequency-info                # Current frequencies
cpupower frequency-set -g performance  # Set performance governor
cpupower idle-info                     # C-states info

# Per-core CPU usage
mpstat -P ALL 1                        # Per-core stats every second
mpstat -I                               # Per-interrupt stats

# Process per-CPU usage
pidstat -p ALL 1                       # Per-process stats
ps -eo pid,pcpu,comm --sort=-pcpu | head  # Top CPU processes

# Interrupt statistics
cat /proc/interrupts                   # All interrupts
mpstat -I SUM 1                        # Interrupt per second
watch -n1 'cat /proc/softirqs'         # Soft interrupts

# CPU performance counters
perf stat -a -e cycles,instructions,branches,cache-references \
    -- sleep 10                        # System-wide performance

# Check for CPU throttling
dmesg | grep -i throttle
turbostat --interval 5                 # Real-time frequency/temp

# CPU temperature
sensors                                # All sensor readings
sensors coretemp-isa-0000              # Intel CPU temps
sensors k10temp-pci-00c3              # AMD CPU temps
watch -n1 sensors                     # Monitor temps continuously

# Thermal throttling check
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
dmesg | grep -i "cpu.*throttl"

CPU Stress Testing

# ============================================================
# CPU STRESS TESTING
# ============================================================

# Using stress
sudo apt install stress
stress --cpu 8 --timeout 60           # Stress 8 cores for 60s

# Using sysbench (CPU benchmark)
sysbench cpu --cpu-max-prime=20000 --threads=8 run

# Using mprime (burn-in test)
# Download from https://www.mersenne.org/download/

# Multi-threaded CPU stress
stress-ng --cpu 8 --cpu-load 80 --timeout 60s

# Quick CPU burn test
yes > /dev/null &                     # Run multiple of these

CPU Error Detection

# ============================================================
# CPU ERROR DETECTION
# ============================================================

# Machine Check Architecture (MCA)
mcelog                                  # Check machine checks
mcelog --client                        # Daemon mode
cat /var/log/mcelog                    # Historical errors

# dmesg CPU errors
dmesg | grep -i 'cpu\|mce\|thermal'
dmesg | grep -i "hardware error\|machine check"

# CPU-specific error counters
cat /sys/devices/system/cpu/cpu0/err_info

# Performance monitoring
vmstat 1                               # CPU context switches
sar -u 1                                # CPU utilization

90.2 Memory Diagnostics

Memory Health Check

┌─────────────────────────────────────────────────────────────────────────┐
│                     MEMORY DIAGNOSTICS                                 │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌─────────────────────────────────────────────────────────────────┐   │
│   │                   MEMORY TESTING PYRAMID                        │   │
│   ├─────────────────────────────────────────────────────────────────┤   │
│   │                                                                  │   │
│   │                         ┌─────────┐                              │   │
│   │                         │ Memtest │  Full memory test         │   │
│   │                         │ 86+     │  (booted from USB/CD)     │   │
│   │                         └────┬────┘                            │   │
│   │                              │                                 │   │
│   │                      ┌───────┴───────┐                         │   │
│   │                      ▼               ▼                         │   │
│   │                 ┌─────────┐    ┌─────────┐                   │   │
│   │                 │memtester│    │stress-ng│  Live testing    │   │
│   │                 │(user)   │    │         │  (no reboot)     │   │
│   │                 └────┬────┘    └────┬────┘                   │   │
│   │                      │               │                        │   │
│   │              ┌───────┴───────┐  ┌───┴────────┐               │   │
│   │              ▼               ▼  ▼            ▼               │   │
│   │         ┌────────┐     ┌────────┐    ┌────────────┐        │   │
│   │         │/proc/  │     │dmesg   │    │  mbw       │        │   │
│   │         │meminfo │     │errors  │    │(bandwidth) │        │   │
│   │         └────────┘     └────────┘    └────────────┘        │   │
│   │                                                                  │   │
│   └─────────────────────────────────────────────────────────────────┘   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

# ============================================================
# MEMORY DIAGNOSTICS
# ============================================================

# Basic memory info
free -h                                 # Human-readable summary
cat /proc/meminfo                       # Detailed memory info
vmstat -s                               # Memory statistics

# Memory usage per process
ps aux --sort=-%mem | head             # Sort by memory
pmap -x $(pgrep process_name)          # Memory map of process
smem -m                                 # Memory usage by mapping
smem -p                                 # Memory by process

# Memory errors in dmesg
dmesg | grep -i 'memory\|edac\|error\|fail'

# Check for memory errors
dmesg | grep -i "hardware memory error"
dmesg | grep -i "uncorrectable"

# ECC memory status (if available)
# For systems with ECC memory:
edac-util --status                      # Check ECC status
edac-ctl --status                       # EDAC status
cat /sys/devices/system/edac/mc/mc0/ce_count  # Correctable errors
cat /sys/devices/system/edac/mc/mc0/ue_count  # Uncorrectable errors

# Memory bandwidth test
# Install: sudo apt install mbw
mbw -n 3 1000                          # 3 runs, 1000MB test
# Alternative: streams (for high-performance)

# Stress testing memory
stress-ng --vm 4 --vm-bytes 2G --timeout 60s

# Memory latency test
mlatency                               # From mbw or custom

# Check for memory pressure
cat /proc/pressure/memory               # PSI (Pressure Stall Information)

Memory Testing Tools

# ============================================================
# COMPREHENSIVE MEMORY TESTING
# ============================================================

# Memtest86+ (run from bootable USB)
# Download: https://www.memtest86.com/
# Or use: sudo apt install memtest86+

# memtester (user-space testing)
sudo memtester 2G 1                    # Test 2GB, 1 pass

# More thorough testing
sudo memtester 4G 4                    # Test 4GB, 4 passes

# Multiple passes for better coverage
for i in {1..10}; do
    echo "Pass $i"
    sudo memtester 2G 1 || echo "FAIL on pass $i"
done

# Quick memory integrity check
sudo apt install dmidecode
sudo dmidecode -t memory               # Memory device info
sudo dmidecode -t memory | grep -i error

90.3 Disk Diagnostics

Storage Health Check

# ============================================================
# STORAGE DIAGNOSTICS - SMART
# ============================================================

# SMART overall health
sudo smartctl -H /dev/sda              # Short health test
sudo smartctl -H -d ata /dev/sda       # Direct ATA access

# Detailed SMART info
sudo smartctl -a /dev/sda               # Full SMART data
sudo smartctl -x /dev/sda              # Extended info

# SMART attributes (look for warning values)
sudo smartctl -A /dev/sda | grep -E "^ID|RAW_VALUE|WORST|THRESH"
# Pay attention to:
# - Reallocated_Sector_Ct (should be 0)
# - Power_On_Hours (keep track)
# - Spin_Retry_Count (should be 0)
# - Temperature_Celsius (should be < 40-50°C)
# - Load_Cycle_Count (SSDs: high is bad)

# Run SMART tests
sudo smartctl -t short /dev/sda        # Short test (~2 min)
sudo smartctl -t long /dev/sda         # Long test (~1-2 hours)
sudo smartctl -t conveyance /dev/sda   # Conveyance test

# Check test results
sudo smartctl -l selftest /dev/sda    # Test results
sudo smartctl -l error /dev/sda        # Error log

# Continuous SMART monitoring
# Install: sudo apt install smartmontools
sudo systemctl enable smartd
# Configure in /etc/smartd.conf

# NVMe diagnostics
sudo nvme list                         # List NVMe devices
sudo nvme smart-log /dev/nvme0n1       # SMART log
sudo nvme health-log /dev/nvme0n1      # Health info
sudo nvme error-log /dev/nvme0n1      # Error log

Disk Performance Testing

# ============================================================
# DISK PERFORMANCE TESTING
# ============================================================

# Quick benchmark with hdparm
sudo hdparm -tT /dev/sda              # Read speed (cached/buffered)

# Sequential read test
sudo hdparm --direct -t /dev/sda       # O_DIRECT read test

# fio - Flexible I/O Tester (comprehensive)
# Install: sudo apt install fio

# Random read test (4K blocks)
fio --name=randread --ioengine=libaio --iodepth=4 --rw=randread \
    --bs=4k --size=1G --numjobs=4 --runtime=30 --time_based \
    --readonly --filename=/tmp/fiotest

# Sequential write test
fio --name=seqwrite --ioengine=libaio --iodepth=1 --rw=write \
    --bs=1m --size=1G --numjobs=1 --runtime=30 --time_based \
    --filename=/tmp/fiotest --direct=1

# Random read/write mix (70/30)
fio --name=mixed --ioengine=libaio --iodepth=4 --rw=randrw \
    --rwmixread=70 --bs=4k --size=1G --numjobs=4 --runtime=30 \
    --time_based --filename=/tmp/fiotest --direct=1

# Read test results
# Key metrics:
# - IOPS: Higher is better
# - Latency: Lower is better
# - Throughput (MB/s): Higher is better

Disk I/O Analysis

# ============================================================
# DISK I/O ANALYSIS
# ============================================================

# Real-time I/O stats
iostat -x 1                           # Extended stats
iostat -xz 1                          # Per-device stats
iotop                                  # Per-process I/O

# Process I/O
pidstat -d 1                          # Per-process I/O
lsof +D / mount_point                 # Files open on mount

# Block device stats
cat /proc/diskstats                    # Detailed disk stats
blktrace -d /dev/sda -o /tmp/sda      # Trace I/O
blkparse -i /tmp/sda                   # Analyze trace

# I/O scheduler
cat /sys/block/sda/queue/scheduler    # Current scheduler
# Options: none, mq-deadline, bfq, kyber
echo mq-deadline > /sys/block/sda/queue/scheduler

# Queue depth
cat /sys/block/sda/queue/nr_requests  # Queue depth
cat /sys/block/sda/queue/read_ahead_kb

# Check for I/O errors
dmesg | grep -i 'I/O error\|device\|disk'
dmesg | grep -i sd                    # SCSI/SATA errors
dmesg | grep -i nvme                 # NVMe errors

# Filesystem health
sudo tune2fs -l /dev/sda1 | grep -i error
sudo xfs_info /mount/point           # XFS info

90.4 Network Diagnostics

Network Interface Diagnostics

# ============================================================
# NETWORK DIAGNOSTICS
# ============================================================

# Interface status
ip link show
ip addr show
ethtool eth0                          # Detailed interface info

# Link status and capabilities
sudo ethtool eth0                     # All info
sudo ethtool -S eth0                 # Statistics
sudo ethtool -g eth0                 # Ring parameters
sudo ethtool -k eth0                  # Offload features

# Speed and duplex
sudo ethtool eth0 | grep -i speed
sudo ethtool eth0 | grep -i duplex

# Set speed manually (if auto-negotiate fails)
sudo ethtool -s eth0 speed 1000 duplex full autoneg off

# Port statistics
ip -s link show eth0
cat /proc/net/dev

# Interrupt coalescing
sudo ethtool -C eth0 rx-usecs 100 tx-usecs 100

# Check network errors
ip -s link show eth0 | grep -i error
netstat -i                           # Interface stats
netstat -s                          # Protocol stats

# Network performance testing
# Install: sudo apt install iperf3
# Server: iperf3 -s
# Client: iperf3 -c server-ip -P 4  # Parallel streams

# Latency testing
ping -c 100 -s 1472 8.8.8.8          # Packet loss/latency
mtr -c 100 8.8.8.8                  # Traceroute + ping

Network Traffic Analysis

# ============================================================
# NETWORK TRAFFIC ANALYSIS
# ============================================================

# Capture packets
tcpdump -i eth0 -c 100                # Capture 100 packets
tcpdump -i eth0 -w capture.pcap       # Save to file
tcpdump -r capture.pcap               # Read from file

# Filter examples
tcpdump -i eth0 tcp port 80          # HTTP traffic
tcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0'  # SYN packets
tcpdump -i eth0 host 192.168.1.1     # Specific host

# Bandwidth monitoring
bmon -p eth0                         # Bandwidth monitor
iftop                                # Per-connection bandwidth
nethogs                              # Per-process network

# Connection tracking
ss -tunap                            # Connection states
netstat -an | grep ESTABLISHED
cat /proc/net/nf_conntrack           # Connection tracking

90.5 PCI and USB Diagnostics

Hardware Detection

# ============================================================
# HARDWARE DETECTION
# ============================================================

# PCI devices
lspci                                 # List PCI devices
lspci -v                              # Verbose
lspci -vv -nn                         # Names and IDs
lspci -k                              # Kernel driver

# USB devices
lsusb                                 # List USB devices
lsusb -v                              # Verbose
lsusb -t                             # Tree view

# Hardware details
lshw                                  # Hardware list (verbose)
lshw -short                           # Short summary
lshw -html > hardware.html           # HTML report

# Device information
cat /proc/bus/pci/devices            # PCI devices
cat /proc/bus/usb/devices            # USB devices

# Specific device info
lspci -v -s 01:00.0                  # Specific device
lsusb -v -d 046d:                    # Specific vendor

GPU Diagnostics

# ============================================================
# GPU DIAGNOSTICS
# ============================================================

# NVIDIA GPUs
nvidia-smi                            # GPU status
nvidia-smi -q                         # Detailed query
nvidia-smi --query-gpu=timestamp,temperature.gpu,utilization.gpu,utilization.memory \
    --format=csv -l 1               # Continuous monitoring
nvidia-smi -pm 1                     # Enable persistence mode

# AMD GPUs (ROCm)
rocm-smi                             # AMD GPU management
rocm-smi --showtemp                  # Temperature
rocm-smi --showutil                  # Utilization

# Intel GPUs
intel_gpu_top                         # Intel GPU monitoring
cat /sys/class/drm/card0/gt_*_freq_mhz  # Frequency info

90.6 System Monitoring and Logging

Hardware Event Logs

# ============================================================
# HARDWARE EVENT LOGGING
# ============================================================

# Kernel messages
dmesg                                 # Current boot messages
dmesg | less
dmesg -T                             # Human-readable timestamps
dmesg --level=err,warn              # Errors and warnings

# Persistent logs
journalctl -b -k                     # Kernel messages this boot
journalctl -k -f                     # Follow kernel messages
journalctl -k --since "1 hour ago"

# Hardware-specific logs
/var/log/kern.log                   # Kernel log
/var/log/syslog                     # System messages
/var/log/dmesg                      # (some systems)

# IPMI/BMC logs (server management)
ipmitool sel list                    # System event log
ipmitool sel elist                   # Extended list
ipmitool sensor                      # Sensor readings

# Dell iDRAC
# racadm getsensorinfo
# racadm getsel

# HP iLO
# hponcfg -a -r
# ilo4-ctl -i <ilo-ip> -u admin -p password sensor

Performance Monitoring

# ============================================================
# SYSTEM MONITORING
# ============================================================

# Overall system health
htop                                  # Interactive process viewer
top                                   # Process viewer
atop                                  # Advanced top
glances                               # Cross-platform monitoring

# Continuous monitoring
vmstat 1                              # Virtual memory stats
iostat -x 1                          # I/O statistics
mpstat -P ALL 1                      # Per-CPU stats
sar -u 1                              # System activity

# Performance data collection
sar -A                                # All data
sar -r                                # Memory
sar -d                                # Disk
sar -n DEV                            # Network

# Disk I/O monitoring
iostat -xz 1                          # Per-device stats
iotop                                 # Per-process I/O

# Network monitoring
iftop                                # Interface bandwidth
nethogs                              # Per-process network
bmon                                 # Bandwidth monitor

90.7 Temperature and Power

Thermal Monitoring

# ============================================================
# TEMPERATURE MONITORING
# ============================================================

# All sensors
sensors                               # All available sensors
sensors -v                           # Verbose

# Specific sensors
sensors coretemp-isa-0000            # Intel CPU
sensors k10temp-pci-00c3            # AMD CPU
sensors nvme-pci-00c3               # NVMe drive
sensors acpitz-virtual-0            # ACPI thermal

# Monitor continuously
watch -n1 sensors

# Thermal zones
ls /sys/class/thermal/thermal_zone*/temp
cat /sys/class/thermal/thermal_zone*/temp
cat /sys/class/thermal/thermal_zone*/trip_point_*_temp

# CPU frequency (thermal throttling)
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq

# Fan speed
sensors | grep -i fan
cat /proc/acpi/fan/*/state

Power Management

# ============================================================
# POWER MANAGEMENT
# ============================================================

# Power consumption (if supported)
# Requires: sudo apt install turbostat
turbostat --interval 5 --show PkgTmp,PkgWatt,CPUWatt

# Intel RAPL (Running Average Power Limit)
# Available on modern Intel CPUs
cat /sys/class/powercap/intel-rapl:0/energy_uj
cat /sys/class/powercap/intel-rapl:0/constraint_0_power_limit_uw

# Battery info (laptops)
upower -i /org/freedesktop/UPower/devices/battery_BAT0
cat /sys/class/power_supply/BAT0/uevent

# CPU power states
cpupower idle-info                    # C-states
cpupower frequency-info               # P-states

# Wake-on-LAN
sudo ethtool eth0 | grep Wake-on
sudo ethtool -s eth0 wol g            # Enable WOL

90.8 Comprehensive Hardware Check Script

#!/bin/bash
# hardware_diag.sh - Comprehensive hardware diagnostics

echo "=============================================="
echo "  Hardware Diagnostics Report"
echo "  $(date)"
echo "=============================================="

# System info
echo -e "\n=== System Information ==="
uname -a
dmidecode -t system | grep -E "Manufacturer|Product|Version"

# CPU
echo -e "\n=== CPU Information ==="
lscpu | grep -E "Model name|CPU\(s\)|Thread|Core|Socket|CPU MHz|Cache"
sensors 2>/dev/null | grep -E "Core|temp" | head -10

# Memory
echo -e "\n=== Memory Information ==="
free -h
sudo dmidecode -t memory | grep -E "Size|Speed|Type:" | head -20

# Storage
echo -e "\n=== Storage Information ==="
lsblk -o NAME,SIZE,TYPE,MOUNTPOINT,MODEL
sudo smartctl -H /dev/sda 2>/dev/null | head -5

# Network
echo -e "\n=== Network Information ==="
ip link show | grep -E "^[0-9]+:|inet "
sudo ethtool eth0 2>/dev/null | grep -E "Speed|Duplex|Link"

# Temperature
echo -e "\n=== Temperature ==="
sensors 2>/dev/null | grep -E "temp|Core"

# Errors
echo -e "\n=== Recent Hardware Errors ==="
dmesg | grep -iE "error|fail|hardware" | tail -20

echo -e "\n=============================================="
echo "  Diagnostics Complete"
echo "=============================================="

90.9 Interview Questions

┌─────────────────────────────────────────────────────────────────────────┐
│                  HARDWARE DIAGNOSTICS INTERVIEW QUESTIONS              │
├─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q1: How would you check if a disk is failing?                             │
                                                                         │
A1:                                                                       │
1. SMART data: smartctl -a /dev/sda                                      │
   - Check Reallocated_Sector_Ct (should be 0)                          │
   - Check Power_On_Hours (lifetime hours)                              │
   - Check Temperature                                                   │
                                                                         │
2. Check health: smartctl -H /dev/sda                                    │
   - "PASSED" = healthy                                                  │
   - "FAILED" = predictively failing                                    │
                                                                         │
3. Run tests:                                                            │
   - smartctl -t short /dev/sda (quick test)                           │
   - smartctl -t long /dev/sda (comprehensive)                          │
                                                                         │
4. Check dmesg: dmesg | grep -i sd                                       │
                                                                         │
Q2: What tools would you use to diagnose high CPU usage?                  │
                                                                         │
A2:                                                                       │
- top/htop: Identify CPU-intensive processes                            │
- mpstat -P ALL: Per-core usage                                         │
- pidstat -p: Per-process CPU usage                                     │
- perf top: CPU profiling with symbol resolution                        │
- turbostat: Check for thermal throttling                               │
- dmesg: Check for CPU throttling messages                              │
                                                                         │
Q3: How do you test memory for errors?                                   │
                                                                         │
A3:                                                                       │
- Memtest86+: Boot from USB, tests all RAM                            │
- memtester: User-space testing (no reboot needed)                     │
- ECC memory: Check via edac-util or dmesg                              │
- For production: Schedule maintenance window, run Memtest86+         │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q4: What is SMART and how does it work?                                  │
                                                                         │
A4:                                                                       │
SMART (Self-Monitoring, Analysis, and Reporting Technology):             │
- Disk monitors itself and tracks metrics                               │
- Attributes: Reallocated sectors, spin retry, temp, etc.              │
- Pre-failure metrics indicate impending failure                       │
- Tests can be scheduled (short/long/conveyance)                       │
- Not all failures are predicted (sudden mechanical failure)            │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q5: How would you diagnose network performance issues?                  │
                                                                         │
A5:                                                                       │
1. Check physical link: ethtool eth0                                    │
2. Check errors: ip -s link show eth0                                   │
3. Test bandwidth: iperf3                                              │
4. Check latency: ping/mtr                                              │
5. Capture packets: tcpdump                                             │
6. Check DNS resolution: dig/nslookup                                  │
7. Check firewall: iptables -L -n                                      │
8. Check MTU: ip link | grep mtu                                       │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q6: What causes thermal throttling and how to detect it?               │
                                                                         │
A6:                                                                       │
Causes:                                                                  │
- High ambient temperature                                             │
- Poor thermal paste/heat sink contact                                  │
- Dust blocking airflow                                                 │
- Failed fan                                                            │
- Overclocking                                                          │
                                                                         │
Detection:                                                               │
- dmesg | grep throttle                                                │
- turbostat (shows throttling)                                         │
- Check scaling_cur_freq vs max freq                                    │
- sensors command shows high temps                                      │
- cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq           │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q7: How do you benchmark disk I/O performance?                          │
                                                                         │
A7:                                                                       │
- hdparm -tT: Quick sequential read                                    │
- fio: Comprehensive benchmarking                                      │
  - Random read: fio --name=randread --rw=randread --bs=4k            │
  - Sequential: fio --name=seqread --rw=read --bs=1m                 │
  - Mixed: fio --name=mixed --rw=randrw --rwmixread=70               │
- iostat: Real-time monitoring during load                            │
- Key metrics: IOPS, throughput, latency                               │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q8: What is ECC memory and why is it important?                          │
                                                                         │
A8:                                                                       │
ECC (Error-Correcting Code) memory:                                     │
- Detects and corrects single-bit errors                               │
- Detects (but cannot correct) multi-bit errors                        │
- Used in servers, workstations, critical systems                      │
- More expensive, requires ECC-capable motherboard/CPU                 │
- Check via: edac-util or dmesg | grep -i edac                         │
- Prevents data corruption from memory errors                          │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q9: How would you troubleshoot intermittent hardware failures?         │
                                                                         │
A9:                                                                       │
1. Check system logs: journalctl, dmesg, /var/log/*                    │
2. Enable persistent logging                                            │
3. Check SMART for storage issues                                       │
4. Run memory tests (memtester/memtest86+)                              │
5. Check power supply and temperatures                                  │
6. Check for loose connections (reseat components)                    │
7. Update firmware/BIOS                                                 │
8. Monitor with Nagios/Zabbix for patterns                            │
9. Consider hardware diagnostics from vendor (Dell/HPE/HP)            │
                                                                         │
─────────────────────────────────────────────────────────────────────────┤
                                                                         │
Q10: What monitoring tools would you use for hardware health?          │
                                                                         │
A10:                                                                      │
- Prometheus + node_exporter: Hardware metrics                         │
- Smartd: SMART monitoring                                             │
- Grafana: Visualization                                               │
- Nagios/Icinga: Alerting                                               │
- Hardware: IPMI/BMC monitoring (ipmitool)                             │
- Custom scripts via cron                                              │
- Cloud: AWS CloudWatch, Azure Monitor, GCP Operations                │
                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Quick Reference

# CPU
lscpu                              # CPU info
sensors                            # Temperature
turbostat --interval 5             # Real-time stats

# Memory
free -h                            # Memory usage
memtester 2G 1                     # Memory test
dmesg | grep -i memory             # Memory errors

# Disk
sudo smartctl -H /dev/sda          # SMART health
sudo smartctl -t long /dev/sda    # Long test
fio --name=test --rw=randread      # Benchmark
iostat -x 1                        # I/O stats

# Network
sudo ethtool eth0                  # Interface info
iperf3 -c server                   # Speed test
tcpdump -i eth0                    # Packet capture

# System
htop                               # Process monitor
dmesg | tail                       # Recent messages
journalctl -k -f                   # Kernel log

Summary

CPU: lscpu, turbostat, sensors for diagnostics
Memory: memtester, memtest86+ for testing
Storage: smartctl for SMART, fio for benchmarks
Network: ethtool, iperf3 for diagnostics
Monitoring: Continuous logging and alerting essential
Errors: Check dmesg, journalctl, SMART logs

Next Chapter

Chapter 91: Linux on AWS

Last Updated: February 2026