Hardware_diagnostics
Chapter 90: Hardware Diagnostics
Section titled “Chapter 90: Hardware Diagnostics”Overview
Section titled “Overview”Hardware diagnostics are critical for Linux system administrators to identify failing components, optimize performance, and prevent hardware-related outages. This chapter covers comprehensive diagnostics for CPUs, memory, storage, network, and other hardware components. For production environments, understanding hardware health is essential for maintaining system reliability and troubleshooting intermittent issues that may not be apparent through software-level monitoring alone.
90.1 CPU Diagnostics
Section titled “90.1 CPU Diagnostics”CPU Information
Section titled “CPU Information”┌─────────────────────────────────────────────────────────────────────────┐│ CPU DIAGNOSTICS FLOW │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ CPU HEALTH CHECKS │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ │ ││ │ 1. Basic Info: lscpu, /proc/cpuinfo │ ││ │ - Model, cores, threads, frequency │ ││ │ - Cache sizes, flags │ ││ │ │ ││ │ 2. Current Status: top, htop │ ││ │ - Per-core usage, load average │ ││ │ - Per-process CPU usage │ ││ │ │ ││ │ 3. Temperature: sensors, turbostat │ ││ │ - Core temperatures, thermal throttling │ ││ │ │ ││ │ 4. Performance: perf, stress │ ││ │ - Event counting, stress testing │ ││ │ │ ││ │ 5. Errors: dmesg, mcelog │ ││ │ - Hardware errors, machine checks │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘# ============================================================# CPU DIAGNOSTICS COMMANDS# ============================================================
# Basic CPU informationlscpu # Detailed CPU infolscpu -e # CPU topologycat /proc/cpuinfo # Detailed CPU features
# Example lscpu output analysis:# Architecture: x86_64# CPU op-mode(s): 32-bit, 64-bit# Model name: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz# CPU(s): 16 (8 cores × 2 threads per core)# Thread(s) per core: 2# Core(s) per socket: 8# Socket(s): 1# NUMA node(s): 1# CPU MHz: 1200.000 (base frequency)# CPU max MHz: 3300.0000 (turbo boost)# Cache size: 20480 KB (L3 cache)# Flags: fpu vme de pse tsc msr pae mce cx8 ...
# CPU frequency informationcpupower frequency-info # Current frequenciescpupower frequency-set -g performance # Set performance governorcpupower idle-info # C-states info
# Per-core CPU usagempstat -P ALL 1 # Per-core stats every secondmpstat -I # Per-interrupt stats
# Process per-CPU usagepidstat -p ALL 1 # Per-process statsps -eo pid,pcpu,comm --sort=-pcpu | head # Top CPU processes
# Interrupt statisticscat /proc/interrupts # All interruptsmpstat -I SUM 1 # Interrupt per secondwatch -n1 'cat /proc/softirqs' # Soft interrupts
# CPU performance countersperf stat -a -e cycles,instructions,branches,cache-references \ -- sleep 10 # System-wide performance
# Check for CPU throttlingdmesg | grep -i throttleturbostat --interval 5 # Real-time frequency/temp
# CPU temperaturesensors # All sensor readingssensors coretemp-isa-0000 # Intel CPU tempssensors k10temp-pci-00c3 # AMD CPU tempswatch -n1 sensors # Monitor temps continuously
# Thermal throttling checkcat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freqdmesg | grep -i "cpu.*throttl"CPU Stress Testing
Section titled “CPU Stress Testing”# ============================================================# CPU STRESS TESTING# ============================================================
# Using stresssudo apt install stressstress --cpu 8 --timeout 60 # Stress 8 cores for 60s
# Using sysbench (CPU benchmark)sysbench cpu --cpu-max-prime=20000 --threads=8 run
# Using mprime (burn-in test)# Download from https://www.mersenne.org/download/
# Multi-threaded CPU stressstress-ng --cpu 8 --cpu-load 80 --timeout 60s
# Quick CPU burn testyes > /dev/null & # Run multiple of theseCPU Error Detection
Section titled “CPU Error Detection”# ============================================================# CPU ERROR DETECTION# ============================================================
# Machine Check Architecture (MCA)mcelog # Check machine checksmcelog --client # Daemon modecat /var/log/mcelog # Historical errors
# dmesg CPU errorsdmesg | grep -i 'cpu\|mce\|thermal'dmesg | grep -i "hardware error\|machine check"
# CPU-specific error counterscat /sys/devices/system/cpu/cpu0/err_info
# Performance monitoringvmstat 1 # CPU context switchessar -u 1 # CPU utilization90.2 Memory Diagnostics
Section titled “90.2 Memory Diagnostics”Memory Health Check
Section titled “Memory Health Check”┌─────────────────────────────────────────────────────────────────────────┐│ MEMORY DIAGNOSTICS │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ MEMORY TESTING PYRAMID │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ │ ││ │ ┌─────────┐ │ ││ │ │ Memtest │ Full memory test │ ││ │ │ 86+ │ (booted from USB/CD) │ ││ │ └────┬────┘ │ ││ │ │ │ ││ │ ┌───────┴───────┐ │ ││ │ ▼ ▼ │ ││ │ ┌─────────┐ ┌─────────┐ │ ││ │ │memtester│ │stress-ng│ Live testing │ ││ │ │(user) │ │ │ (no reboot) │ ││ │ └────┬────┘ └────┬────┘ │ ││ │ │ │ │ ││ │ ┌───────┴───────┐ ┌───┴────────┐ │ ││ │ ▼ ▼ ▼ ▼ │ ││ │ ┌────────┐ ┌────────┐ ┌────────────┐ │ ││ │ │/proc/ │ │dmesg │ │ mbw │ │ ││ │ │meminfo │ │errors │ │(bandwidth) │ │ ││ │ └────────┘ └────────┘ └────────────┘ │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘# ============================================================# MEMORY DIAGNOSTICS# ============================================================
# Basic memory infofree -h # Human-readable summarycat /proc/meminfo # Detailed memory infovmstat -s # Memory statistics
# Memory usage per processps aux --sort=-%mem | head # Sort by memorypmap -x $(pgrep process_name) # Memory map of processsmem -m # Memory usage by mappingsmem -p # Memory by process
# Memory errors in dmesgdmesg | grep -i 'memory\|edac\|error\|fail'
# Check for memory errorsdmesg | grep -i "hardware memory error"dmesg | grep -i "uncorrectable"
# ECC memory status (if available)# For systems with ECC memory:edac-util --status # Check ECC statusedac-ctl --status # EDAC statuscat /sys/devices/system/edac/mc/mc0/ce_count # Correctable errorscat /sys/devices/system/edac/mc/mc0/ue_count # Uncorrectable errors
# Memory bandwidth test# Install: sudo apt install mbwmbw -n 3 1000 # 3 runs, 1000MB test# Alternative: streams (for high-performance)
# Stress testing memorystress-ng --vm 4 --vm-bytes 2G --timeout 60s
# Memory latency testmlatency # From mbw or custom
# Check for memory pressurecat /proc/pressure/memory # PSI (Pressure Stall Information)Memory Testing Tools
Section titled “Memory Testing Tools”# ============================================================# COMPREHENSIVE MEMORY TESTING# ============================================================
# Memtest86+ (run from bootable USB)# Download: https://www.memtest86.com/# Or use: sudo apt install memtest86+
# memtester (user-space testing)sudo memtester 2G 1 # Test 2GB, 1 pass
# More thorough testingsudo memtester 4G 4 # Test 4GB, 4 passes
# Multiple passes for better coveragefor i in {1..10}; do echo "Pass $i" sudo memtester 2G 1 || echo "FAIL on pass $i"done
# Quick memory integrity checksudo apt install dmidecodesudo dmidecode -t memory # Memory device infosudo dmidecode -t memory | grep -i error90.3 Disk Diagnostics
Section titled “90.3 Disk Diagnostics”Storage Health Check
Section titled “Storage Health Check”# ============================================================# STORAGE DIAGNOSTICS - SMART# ============================================================
# SMART overall healthsudo smartctl -H /dev/sda # Short health testsudo smartctl -H -d ata /dev/sda # Direct ATA access
# Detailed SMART infosudo smartctl -a /dev/sda # Full SMART datasudo smartctl -x /dev/sda # Extended info
# SMART attributes (look for warning values)sudo smartctl -A /dev/sda | grep -E "^ID|RAW_VALUE|WORST|THRESH"# Pay attention to:# - Reallocated_Sector_Ct (should be 0)# - Power_On_Hours (keep track)# - Spin_Retry_Count (should be 0)# - Temperature_Celsius (should be < 40-50°C)# - Load_Cycle_Count (SSDs: high is bad)
# Run SMART testssudo smartctl -t short /dev/sda # Short test (~2 min)sudo smartctl -t long /dev/sda # Long test (~1-2 hours)sudo smartctl -t conveyance /dev/sda # Conveyance test
# Check test resultssudo smartctl -l selftest /dev/sda # Test resultssudo smartctl -l error /dev/sda # Error log
# Continuous SMART monitoring# Install: sudo apt install smartmontoolssudo systemctl enable smartd# Configure in /etc/smartd.conf
# NVMe diagnosticssudo nvme list # List NVMe devicessudo nvme smart-log /dev/nvme0n1 # SMART logsudo nvme health-log /dev/nvme0n1 # Health infosudo nvme error-log /dev/nvme0n1 # Error logDisk Performance Testing
Section titled “Disk Performance Testing”# ============================================================# DISK PERFORMANCE TESTING# ============================================================
# Quick benchmark with hdparmsudo hdparm -tT /dev/sda # Read speed (cached/buffered)
# Sequential read testsudo hdparm --direct -t /dev/sda # O_DIRECT read test
# fio - Flexible I/O Tester (comprehensive)# Install: sudo apt install fio
# Random read test (4K blocks)fio --name=randread --ioengine=libaio --iodepth=4 --rw=randread \ --bs=4k --size=1G --numjobs=4 --runtime=30 --time_based \ --readonly --filename=/tmp/fiotest
# Sequential write testfio --name=seqwrite --ioengine=libaio --iodepth=1 --rw=write \ --bs=1m --size=1G --numjobs=1 --runtime=30 --time_based \ --filename=/tmp/fiotest --direct=1
# Random read/write mix (70/30)fio --name=mixed --ioengine=libaio --iodepth=4 --rw=randrw \ --rwmixread=70 --bs=4k --size=1G --numjobs=4 --runtime=30 \ --time_based --filename=/tmp/fiotest --direct=1
# Read test results# Key metrics:# - IOPS: Higher is better# - Latency: Lower is better# - Throughput (MB/s): Higher is betterDisk I/O Analysis
Section titled “Disk I/O Analysis”# ============================================================# DISK I/O ANALYSIS# ============================================================
# Real-time I/O statsiostat -x 1 # Extended statsiostat -xz 1 # Per-device statsiotop # Per-process I/O
# Process I/Opidstat -d 1 # Per-process I/Olsof +D / mount_point # Files open on mount
# Block device statscat /proc/diskstats # Detailed disk statsblktrace -d /dev/sda -o /tmp/sda # Trace I/Oblkparse -i /tmp/sda # Analyze trace
# I/O schedulercat /sys/block/sda/queue/scheduler # Current scheduler# Options: none, mq-deadline, bfq, kyberecho mq-deadline > /sys/block/sda/queue/scheduler
# Queue depthcat /sys/block/sda/queue/nr_requests # Queue depthcat /sys/block/sda/queue/read_ahead_kb
# Check for I/O errorsdmesg | grep -i 'I/O error\|device\|disk'dmesg | grep -i sd # SCSI/SATA errorsdmesg | grep -i nvme # NVMe errors
# Filesystem healthsudo tune2fs -l /dev/sda1 | grep -i errorsudo xfs_info /mount/point # XFS info90.4 Network Diagnostics
Section titled “90.4 Network Diagnostics”Network Interface Diagnostics
Section titled “Network Interface Diagnostics”# ============================================================# NETWORK DIAGNOSTICS# ============================================================
# Interface statusip link showip addr showethtool eth0 # Detailed interface info
# Link status and capabilitiessudo ethtool eth0 # All infosudo ethtool -S eth0 # Statisticssudo ethtool -g eth0 # Ring parameterssudo ethtool -k eth0 # Offload features
# Speed and duplexsudo ethtool eth0 | grep -i speedsudo ethtool eth0 | grep -i duplex
# Set speed manually (if auto-negotiate fails)sudo ethtool -s eth0 speed 1000 duplex full autoneg off
# Port statisticsip -s link show eth0cat /proc/net/dev
# Interrupt coalescingsudo ethtool -C eth0 rx-usecs 100 tx-usecs 100
# Check network errorsip -s link show eth0 | grep -i errornetstat -i # Interface statsnetstat -s # Protocol stats
# Network performance testing# Install: sudo apt install iperf3# Server: iperf3 -s# Client: iperf3 -c server-ip -P 4 # Parallel streams
# Latency testingping -c 100 -s 1472 8.8.8.8 # Packet loss/latencymtr -c 100 8.8.8.8 # Traceroute + pingNetwork Traffic Analysis
Section titled “Network Traffic Analysis”# ============================================================# NETWORK TRAFFIC ANALYSIS# ============================================================
# Capture packetstcpdump -i eth0 -c 100 # Capture 100 packetstcpdump -i eth0 -w capture.pcap # Save to filetcpdump -r capture.pcap # Read from file
# Filter examplestcpdump -i eth0 tcp port 80 # HTTP traffictcpdump -i eth0 'tcp[tcpflags] & tcp-syn != 0' # SYN packetstcpdump -i eth0 host 192.168.1.1 # Specific host
# Bandwidth monitoringbmon -p eth0 # Bandwidth monitoriftop # Per-connection bandwidthnethogs # Per-process network
# Connection trackingss -tunap # Connection statesnetstat -an | grep ESTABLISHEDcat /proc/net/nf_conntrack # Connection tracking90.5 PCI and USB Diagnostics
Section titled “90.5 PCI and USB Diagnostics”Hardware Detection
Section titled “Hardware Detection”# ============================================================# HARDWARE DETECTION# ============================================================
# PCI deviceslspci # List PCI deviceslspci -v # Verboselspci -vv -nn # Names and IDslspci -k # Kernel driver
# USB deviceslsusb # List USB deviceslsusb -v # Verboselsusb -t # Tree view
# Hardware detailslshw # Hardware list (verbose)lshw -short # Short summarylshw -html > hardware.html # HTML report
# Device informationcat /proc/bus/pci/devices # PCI devicescat /proc/bus/usb/devices # USB devices
# Specific device infolspci -v -s 01:00.0 # Specific devicelsusb -v -d 046d: # Specific vendorGPU Diagnostics
Section titled “GPU Diagnostics”# ============================================================# GPU DIAGNOSTICS# ============================================================
# NVIDIA GPUsnvidia-smi # GPU statusnvidia-smi -q # Detailed querynvidia-smi --query-gpu=timestamp,temperature.gpu,utilization.gpu,utilization.memory \ --format=csv -l 1 # Continuous monitoringnvidia-smi -pm 1 # Enable persistence mode
# AMD GPUs (ROCm)rocm-smi # AMD GPU managementrocm-smi --showtemp # Temperaturerocm-smi --showutil # Utilization
# Intel GPUsintel_gpu_top # Intel GPU monitoringcat /sys/class/drm/card0/gt_*_freq_mhz # Frequency info90.6 System Monitoring and Logging
Section titled “90.6 System Monitoring and Logging”Hardware Event Logs
Section titled “Hardware Event Logs”# ============================================================# HARDWARE EVENT LOGGING# ============================================================
# Kernel messagesdmesg # Current boot messagesdmesg | lessdmesg -T # Human-readable timestampsdmesg --level=err,warn # Errors and warnings
# Persistent logsjournalctl -b -k # Kernel messages this bootjournalctl -k -f # Follow kernel messagesjournalctl -k --since "1 hour ago"
# Hardware-specific logs/var/log/kern.log # Kernel log/var/log/syslog # System messages/var/log/dmesg # (some systems)
# IPMI/BMC logs (server management)ipmitool sel list # System event logipmitool sel elist # Extended listipmitool sensor # Sensor readings
# Dell iDRAC# racadm getsensorinfo# racadm getsel
# HP iLO# hponcfg -a -r# ilo4-ctl -i <ilo-ip> -u admin -p password sensorPerformance Monitoring
Section titled “Performance Monitoring”# ============================================================# SYSTEM MONITORING# ============================================================
# Overall system healthhtop # Interactive process viewertop # Process vieweratop # Advanced topglances # Cross-platform monitoring
# Continuous monitoringvmstat 1 # Virtual memory statsiostat -x 1 # I/O statisticsmpstat -P ALL 1 # Per-CPU statssar -u 1 # System activity
# Performance data collectionsar -A # All datasar -r # Memorysar -d # Disksar -n DEV # Network
# Disk I/O monitoringiostat -xz 1 # Per-device statsiotop # Per-process I/O
# Network monitoringiftop # Interface bandwidthnethogs # Per-process networkbmon # Bandwidth monitor90.7 Temperature and Power
Section titled “90.7 Temperature and Power”Thermal Monitoring
Section titled “Thermal Monitoring”# ============================================================# TEMPERATURE MONITORING# ============================================================
# All sensorssensors # All available sensorssensors -v # Verbose
# Specific sensorssensors coretemp-isa-0000 # Intel CPUsensors k10temp-pci-00c3 # AMD CPUsensors nvme-pci-00c3 # NVMe drivesensors acpitz-virtual-0 # ACPI thermal
# Monitor continuouslywatch -n1 sensors
# Thermal zonesls /sys/class/thermal/thermal_zone*/tempcat /sys/class/thermal/thermal_zone*/tempcat /sys/class/thermal/thermal_zone*/trip_point_*_temp
# CPU frequency (thermal throttling)cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq
# Fan speedsensors | grep -i fancat /proc/acpi/fan/*/statePower Management
Section titled “Power Management”# ============================================================# POWER MANAGEMENT# ============================================================
# Power consumption (if supported)# Requires: sudo apt install turbostatturbostat --interval 5 --show PkgTmp,PkgWatt,CPUWatt
# Intel RAPL (Running Average Power Limit)# Available on modern Intel CPUscat /sys/class/powercap/intel-rapl:0/energy_ujcat /sys/class/powercap/intel-rapl:0/constraint_0_power_limit_uw
# Battery info (laptops)upower -i /org/freedesktop/UPower/devices/battery_BAT0cat /sys/class/power_supply/BAT0/uevent
# CPU power statescpupower idle-info # C-statescpupower frequency-info # P-states
# Wake-on-LANsudo ethtool eth0 | grep Wake-onsudo ethtool -s eth0 wol g # Enable WOL90.8 Comprehensive Hardware Check Script
Section titled “90.8 Comprehensive Hardware Check Script”#!/bin/bash# hardware_diag.sh - Comprehensive hardware diagnostics
echo "=============================================="echo " Hardware Diagnostics Report"echo " $(date)"echo "=============================================="
# System infoecho -e "\n=== System Information ==="uname -admidecode -t system | grep -E "Manufacturer|Product|Version"
# CPUecho -e "\n=== CPU Information ==="lscpu | grep -E "Model name|CPU\(s\)|Thread|Core|Socket|CPU MHz|Cache"sensors 2>/dev/null | grep -E "Core|temp" | head -10
# Memoryecho -e "\n=== Memory Information ==="free -hsudo dmidecode -t memory | grep -E "Size|Speed|Type:" | head -20
# Storageecho -e "\n=== Storage Information ==="lsblk -o NAME,SIZE,TYPE,MOUNTPOINT,MODELsudo smartctl -H /dev/sda 2>/dev/null | head -5
# Networkecho -e "\n=== Network Information ==="ip link show | grep -E "^[0-9]+:|inet "sudo ethtool eth0 2>/dev/null | grep -E "Speed|Duplex|Link"
# Temperatureecho -e "\n=== Temperature ==="sensors 2>/dev/null | grep -E "temp|Core"
# Errorsecho -e "\n=== Recent Hardware Errors ==="dmesg | grep -iE "error|fail|hardware" | tail -20
echo -e "\n=============================================="echo " Diagnostics Complete"echo "=============================================="90.9 Interview Questions
Section titled “90.9 Interview Questions”┌─────────────────────────────────────────────────────────────────────────┐│ HARDWARE DIAGNOSTICS INTERVIEW QUESTIONS │├─────────────────────────────────────────────────────────────────────────┤ │Q1: How would you check if a disk is failing? │ │A1: │1. SMART data: smartctl -a /dev/sda │ - Check Reallocated_Sector_Ct (should be 0) │ - Check Power_On_Hours (lifetime hours) │ - Check Temperature │ │2. Check health: smartctl -H /dev/sda │ - "PASSED" = healthy │ - "FAILED" = predictively failing │ │3. Run tests: │ - smartctl -t short /dev/sda (quick test) │ - smartctl -t long /dev/sda (comprehensive) │ │4. Check dmesg: dmesg | grep -i sd │ │Q2: What tools would you use to diagnose high CPU usage? │ │A2: │- top/htop: Identify CPU-intensive processes │- mpstat -P ALL: Per-core usage │- pidstat -p: Per-process CPU usage │- perf top: CPU profiling with symbol resolution │- turbostat: Check for thermal throttling │- dmesg: Check for CPU throttling messages │ │Q3: How do you test memory for errors? │ │A3: │- Memtest86+: Boot from USB, tests all RAM │- memtester: User-space testing (no reboot needed) │- ECC memory: Check via edac-util or dmesg │- For production: Schedule maintenance window, run Memtest86+ │ │─────────────────────────────────────────────────────────────────────────┤ │Q4: What is SMART and how does it work? │ │A4: │SMART (Self-Monitoring, Analysis, and Reporting Technology): │- Disk monitors itself and tracks metrics │- Attributes: Reallocated sectors, spin retry, temp, etc. │- Pre-failure metrics indicate impending failure │- Tests can be scheduled (short/long/conveyance) │- Not all failures are predicted (sudden mechanical failure) │ │─────────────────────────────────────────────────────────────────────────┤ │Q5: How would you diagnose network performance issues? │ │A5: │1. Check physical link: ethtool eth0 │2. Check errors: ip -s link show eth0 │3. Test bandwidth: iperf3 │4. Check latency: ping/mtr │5. Capture packets: tcpdump │6. Check DNS resolution: dig/nslookup │7. Check firewall: iptables -L -n │8. Check MTU: ip link | grep mtu │ │─────────────────────────────────────────────────────────────────────────┤ │Q6: What causes thermal throttling and how to detect it? │ │A6: │Causes: │- High ambient temperature │- Poor thermal paste/heat sink contact │- Dust blocking airflow │- Failed fan │- Overclocking │ │Detection: │- dmesg | grep throttle │- turbostat (shows throttling) │- Check scaling_cur_freq vs max freq │- sensors command shows high temps │- cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq │ │─────────────────────────────────────────────────────────────────────────┤ │Q7: How do you benchmark disk I/O performance? │ │A7: │- hdparm -tT: Quick sequential read │- fio: Comprehensive benchmarking │ - Random read: fio --name=randread --rw=randread --bs=4k │ - Sequential: fio --name=seqread --rw=read --bs=1m │ - Mixed: fio --name=mixed --rw=randrw --rwmixread=70 │- iostat: Real-time monitoring during load │- Key metrics: IOPS, throughput, latency │ │─────────────────────────────────────────────────────────────────────────┤ │Q8: What is ECC memory and why is it important? │ │A8: │ECC (Error-Correcting Code) memory: │- Detects and corrects single-bit errors │- Detects (but cannot correct) multi-bit errors │- Used in servers, workstations, critical systems │- More expensive, requires ECC-capable motherboard/CPU │- Check via: edac-util or dmesg | grep -i edac │- Prevents data corruption from memory errors │ │─────────────────────────────────────────────────────────────────────────┤ │Q9: How would you troubleshoot intermittent hardware failures? │ │A9: │1. Check system logs: journalctl, dmesg, /var/log/* │2. Enable persistent logging │3. Check SMART for storage issues │4. Run memory tests (memtester/memtest86+) │5. Check power supply and temperatures │6. Check for loose connections (reseat components) │7. Update firmware/BIOS │8. Monitor with Nagios/Zabbix for patterns │9. Consider hardware diagnostics from vendor (Dell/HPE/HP) │ │─────────────────────────────────────────────────────────────────────────┤ │Q10: What monitoring tools would you use for hardware health? │ │A10: │- Prometheus + node_exporter: Hardware metrics │- Smartd: SMART monitoring │- Grafana: Visualization │- Nagios/Icinga: Alerting │- Hardware: IPMI/BMC monitoring (ipmitool) │- Custom scripts via cron │- Cloud: AWS CloudWatch, Azure Monitor, GCP Operations │ │└─────────────────────────────────────────────────────────────────────────┘Quick Reference
Section titled “Quick Reference”# CPUlscpu # CPU infosensors # Temperatureturbostat --interval 5 # Real-time stats
# Memoryfree -h # Memory usagememtester 2G 1 # Memory testdmesg | grep -i memory # Memory errors
# Disksudo smartctl -H /dev/sda # SMART healthsudo smartctl -t long /dev/sda # Long testfio --name=test --rw=randread # Benchmarkiostat -x 1 # I/O stats
# Networksudo ethtool eth0 # Interface infoiperf3 -c server # Speed testtcpdump -i eth0 # Packet capture
# Systemhtop # Process monitordmesg | tail # Recent messagesjournalctl -k -f # Kernel logSummary
Section titled “Summary”- CPU: lscpu, turbostat, sensors for diagnostics
- Memory: memtester, memtest86+ for testing
- Storage: smartctl for SMART, fio for benchmarks
- Network: ethtool, iperf3 for diagnostics
- Monitoring: Continuous logging and alerting essential
- Errors: Check dmesg, journalctl, SMART logs
Next Chapter
Section titled “Next Chapter”Last Updated: February 2026