Monitoring_tools
Chapter 38: System Monitoring Tools
Section titled “Chapter 38: System Monitoring Tools”Comprehensive Linux System Monitoring
Section titled “Comprehensive Linux System Monitoring”38.1 CPU Monitoring Tools
Section titled “38.1 CPU Monitoring Tools”Understanding CPU Metrics
Section titled “Understanding CPU Metrics” CPU Performance Metrics+------------------------------------------------------------------+| || Key CPU Metrics: || || +----------------------------------------------------------+ || | %usr | User space CPU usage | || | %nice | User space nice CPU usage | || | %sys | Kernel space CPU usage | || | %iowait | Waiting for I/O | || | %irq | Interrupt handling | || | %softirq| Soft interrupt handling | || | %steal | Virtual CPU waiting (VM) | || | %guest | Running guest VM | || | %idle | Idle time | || +----------------------------------------------------------+ || || Load Average: || +----------------------------------------------------------+ || | 1 min | 5 min | 15 min average runnable processes | || | CPU count matters: load of 4 on 4-core = 100% utilized | || +----------------------------------------------------------+ || || CPU States: || +----------------------------------------------------------+ || | Running | Runnable | Blocked (I/O, sleep) | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+top - Real-time Process Monitor
Section titled “top - Real-time Process Monitor”# Basic usagetop # Interactive modetop -d 1 # Update every 1 secondtop -n 5 # Run 5 iterationstop -u username # Show specific usertop -p 1234 # Monitor specific PID
# Interactive commands# P - Sort by CPU (default)# M - Sort by memory# T - Sort by time (cumulative)# k - Kill process# r - Renice process# f - Field manager# o - Change sort field# 1 - Show per-CPU breakdown# l - Toggle load average# t - Toggle CPU info# m - Toggle memory info# W - Save configuration# q - Quit
# Field displaytop -o %MEM # Sort by memorytop -b -n 1 > output.txt # Batch mode for scriptingtop -b -n 1 -p 1234 # Single PID in batch modehtop - Enhanced Process Viewer
Section titled “htop - Enhanced Process Viewer”# Installationsudo apt install htopsudo yum install htopsudo pacman -S htop
# Usagehtop # Interactivehtop -u username # Filter by userhtop -p 1234,5678 # Monitor specific PIDshtop -d 10 # Delay between updates
# Interactive keys# F1 - Help# F2 - Setup (columns, colors)# F3 - Search process# F4 - Filter processes# F5 - Tree view# F6 - Sort by (column)# F7 - Nice - (increase priority)# F8 - Nice + (decrease priority)# F9 - Kill (send signal)# Space - Tag process# U - Untag all# Ctrl+L - Refresh# Ctrl+C - Cancel
# Tree view shows parent-child relationships# Color coding by CPU/memory usage# Function keys for quick actionsmpstat - Multi-processor Statistics
Section titled “mpstat - Multi-processor Statistics”# Install (sysstat package)sudo apt install sysstat
# All CPUs, 1 second interval, 5 samplesmpstat -P ALL 1 5
# Specific CPUmpstat -P 0 1 5 # CPU 0 only
# Interval and countmpstat 1 # Continuous, 1 second
# Key output columns:# %usr - User time# %nice - Nice time# %sys - System time# %iowait - I/O wait# %irq - IRQ time# %soft - Soft IRQ# %steal - Steal time# %guest - Guest time# %idle - Idle timesar - System Activity Reporter
Section titled “sar - System Activity Reporter”# Installsudo apt install sysstat
# CPU usage for last 24 hours (from log)sar -u # Default for todaysar -u 1 5 # Real-time 5 samplessar -u -s 10:00:00 -e 12:00:00 # Time range
# Historical CPU data# Data stored in /var/log/sa/ or /var/log/sysstat/
# Other sar optionssar -q # Load averagesar -r # Memory usagesar -S # Swap spacesar -d # Block device I/Osar -n DEV # Network devicessar -n SOCK # Socketssar -b # I/O and transfer ratesar -B # Paging statisticssar -W # Swapping statistics
# Create reportssar -u -s 08:00 -e 18:00 -i /var/log/sa/sa20 > report.txtvmstat - Virtual Memory Statistics
Section titled “vmstat - Virtual Memory Statistics”# Basic usagevmstat 1 5 # 1 second, 5 samplesvmstat 1 # Continuous
# Key columns:# r - Runnable processes# b - Blocked processes (I/O)# si - Swap in# so - Swap out# bi - Blocks in (from disk)# bo - Blocks out (to disk)# us - User CPU# sy - System CPU# id - Idle CPU# wa - Wait CPU# st - Stolen CPU (VM)
# Memory columns:# swpd - Virtual memory used# free - Free memory# buff - Buffer memory# cache - Cache memory
# Detailed modevmstat -s # Detailed statisticsvmstat -m # Slab infovmstat -d # Disk statistics38.2 Memory Monitoring
Section titled “38.2 Memory Monitoring”Understanding Linux Memory
Section titled “Understanding Linux Memory” Linux Memory Types+------------------------------------------------------------------+| || Memory Categories: || || +----------------------------------------------------------+ || | MemTotal | Total physical RAM | || | MemFree | Completely unused RAM | || | MemAvailable| Free + buffer/cache (what's actually available) | || | Buffers | Block device cache | || | Cached | Page cache (file-backed) | || | SwapCached | Swapped pages still in memory | || | SwapTotal | Total swap space | || | SwapFree | Free swap space | || +----------------------------------------------------------+ || || Important: Linux uses available memory for buffers/cache || This memory can be reclaimed when applications need it || || Available Memory = Free + Buffers + Cached (approximate) || |+------------------------------------------------------------------+free - Memory Usage
Section titled “free - Memory Usage”# Basic usagefree # Human-readablefree -b # Bytesfree -k # Kilobytes (default)free -m # Megabytesfree -g # Gigabytesfree -h # Human-readable (best)
# Continuous monitoringfree -s 5 # Update every 5 secondswatch -n 1 free # Watch continuously
# Extended viewfree -l # Show low/high memoryfree -t # Show totalsfree -s 5 -c 10 # 10 iterations, 5 second intervals
# Understanding output:# total - Total physical RAM# used - Used (total - free - buffers/cache)# free - Unused RAM# shared - tmpfs (shmem)# buffers - Block device buffers# available - What apps can use (kernel 3.14+)# buff/cache - Buffers + Cache combined/proc/meminfo
Section titled “/proc/meminfo”# View detailed memory infocat /proc/meminfo
# Key fields:# MemTotal: 16384032 kB# MemFree: 2341232 kB# MemAvailable: 12567890 kB # Important!# Buffers: 234567 kB# Cached: 4567890 kB# SwapCached: 1234 kB# Active(anon): 3456789 kB# Inactive(anon): 1234567 kB# Active(file): 5678901 kB# Inactive(file): 2345678 kB# SReclaimable: 567890 kB # Can be reclaimed# SwapTotal: 8388604 kB# SwapFree: 7654321 kB
# Filter specific fieldgrep -E "MemTotal|MemFree|MemAvailable" /proc/meminfoMemory Analysis Tools
Section titled “Memory Analysis Tools”# ps with memory sortingps aux --sort=-%mem | head -10 # Top memory usersps -eo pid,user,%mem,vsz,comm --sort=-%mem | head
# pmap - Process memory mappmap -x 1234 # Extended viewpmap -X 1234 # Even more detail
# smem - Memory reportingsudo apt install smemsmem # All processessmem -u # By usersmem -k # Show chartsmem -t # Totalssmem -p / # Per process
# slabtop - Kernel slab cachesudo slabtop# Shows kernel memory usage (dentry, inode caches)
# /dev/shm (Shared Memory)df -h /dev/shmls -la /dev/shm38.3 Disk Monitoring
Section titled “38.3 Disk Monitoring”Disk Usage Commands
Section titled “Disk Usage Commands”# Disk space overviewdf -h # Human-readabledf -h -T # Include filesystem typedf -i # Show inodesdf -Bg # Blocks in GBdf -T /home # Specific mount point
# Find largest directoriesdu -sh /var/* # Directories in /vardu -h --max-depth=1 # One level deepdu -ah /home/user | sort -rh | head -10 # Top 10 largest files
# ncdu - Interactive disk usagesudo apt install ncduncdu / # Interactivencdu -q # Quiet (for scripting)
# Baobab (GUI disk analyzer)sudo apt install baobabbaobabDisk I/O Monitoring
Section titled “Disk I/O Monitoring”# iostat - I/O statisticsiostat -x 1 5 # Extended, 1 sec, 5 timesiostat -x -k # In KBiostat -m -x # In MBiostat -p sda # Per partition
# Key columns:# r/s - Read requests/second# w/s - Write requests/second# rKB/s - Read KB/second# wKB/s - Write KB/second# await - Average wait time (ms)# svctm - Average service time (ms)# %util - Device utilization %
# iotop - Per-process I/Osudo iotop # Interactivesudo iotop -o # Only active I/Osudo iotop -P # Show processessudo iotop -b # Batch modesudo iotop -a # Accumulated I/O
# pidstat - Per-process I/Opidstat -d 1 5 # Disk I/Opidstat -r 1 5 # Memorypidstat -w 1 5 # Context switchesDisk Health
Section titled “Disk Health”# SMART statussudo smartctl -a /dev/sda # All infosudo smartctl -H /dev/sda # Healthsudo smartctl -t short /dev/sda # Short testsudo smartctl -t long /dev/sda # Long testsudo smartctl -l selftest /dev/sda # Results
# SMART attributes to watch:# 5 - Reallocated Sector Count# 187 - Reported Uncorrectable Errors# 197 - Current Pending Sector Count# 198 - Offline Uncorrectable# 196 - Reallocation Event Count
# Disk partition infolsblk # Tree viewlsblk -f # With filesystemblkid # UUIDs and types
# I/O schedulercat /sys/block/sda/queue/scheduler# Common: mq-deadline, bfq, none
# Set schedulerecho mq-deadline | sudo tee /sys/block/sda/queue/scheduler38.4 Network Monitoring
Section titled “38.4 Network Monitoring”Network Statistics
Section titled “Network Statistics”# Interface statisticsip -s linkip -s link show eth0netstat -i
# Connection trackingss -tunapl # All connectionsss -tunapl | wc -l # Connection countss -tunapl state established
# Network statisticssar -n DEV 1 5 # Device statssar -n SOCK 1 5 # Socket statssar -n TCP 1 5 # TCP statssar -n EDEV 1 5 # Error stats
# netstat (deprecated but useful)netstat -tunaplnetstat -s # Per-protocol statsnetstat -r # Routing tableBandwidth Monitoring
Section titled “Bandwidth Monitoring”# iftop - Bandwidth by hostsudo iftop # Interactivesudo iftop -i eth0 # Interfacesudo iftop -n # No DNS resolutionsudo iftop -B # Bytes not bits
# nethogs - Per-process bandwidthsudo nethogs # Interactivesudo nethogs eth0 # Interfacesudo nethogs -d 1 # 1 second delay
# iptraf-ng - Traffic monitorsudo iptraf-ng
# vnstat - Network traffic loggersudo apt install vnstatvnstat # Summaryvnstat -h # Hourlyvnstat -d # Dailyvnstat -m # Monthlyvnstat -l # Live
# nload - Real-time trafficsudo nloadsudo nload -n # No DNSPacket Capture
Section titled “Packet Capture”# tcpdump - Packet capturesudo tcpdump -i eth0sudo tcpdump -i eth0 port 80sudo tcpdump -i eth0 host 192.168.1.1sudo tcpdump -i eth0 -w capture.pcapsudo tcpdump -r capture.pcap
# tshark - Wireshark CLIsudo tshark -i eth0sudo tshark -r capture.pcap
# ngrep - Network grepsudo ngrep -i eth0 password38.5 Process Monitoring
Section titled “38.5 Process Monitoring”Process Listing
Section titled “Process Listing”# Basic process listingps aux # BSD styleps -ef # Standard styleps -ejH # Process treeps -eo pid,ppid,user,%cpu,%mem,cmd
# Find processespgrep -a nginx # Find by namepgrep -u username # By userpidof nginx # Get PIDs
# Process treepstree # Tree viewpstree -p # With PIDspstree -u username # User's tree
# Process detailsps -fp 1234 # Full info for PIDls -la /proc/1234/ # Process info in /proccat /proc/1234/status # Detailed statusOpen Files
Section titled “Open Files”# lsof - List open fileslsof # All fileslsof -p 1234 # Files for PIDlsof -u username # User's fileslsof -i # Network fileslsof -i :80 # Port 80lsof -i TCP:80 # TCP port 80lsof -i @192.168.1.1 # To specific hostlsof +D /var/log # Files in directory
# fuser - What process using file/portfuser 80/tcp # What's using port 80fuser -k 80/tcp # Kill processes using portfuser /var/log/syslog # Using fileProcess Tracing
Section titled “Process Tracing”# strace - System callsstrace -p 1234 # Attach to processstrace -c -p 1234 # Count syscallsstrace -e openat -p 1234 # Filter syscallsstrace -f -p 1234 # Follow forksstrace -o output.txt -p 1234 # To filestrace -tt -p 1234 # Timestamps
# ltrace - Library callsltrace -p 1234ltrace -c -p 1234 # Count calls
# /proc filesystemcat /proc/1234/cmdline # Command linecat /proc/1234/environ # Environmentls /proc/1234/fd/ # File descriptorscat /proc/1234/maps # Memory mapscat /proc/1234/statm # Memory stats38.6 System Resource Limits
Section titled “38.6 System Resource Limits”ulimit
Section titled “ulimit”# View current limitsulimit -a # All limitsulimit -n # Open filesulimit -u # Max processes
# Set limits (current session)ulimit -n 65535 # Max open filesulimit -u 4096 # Max processesulimit -f 1024 # Max file size (KB)
# Persistent limits# /etc/security/limits.conf
# Examples:# * soft nofile 65535# * hard nofile 65535# root soft nofile 65535# root hard nofile 65535# @student soft nproc 100# @student hard nproc 200
# Also /etc/security/limits.d/*.confcgroups and systemd
Section titled “cgroups and systemd”# systemd resource limits# See service unit file:# MemoryMax=1G# CPUQuota=50%# TasksMax=100
# cgroups v2ls /sys/fs/cgroup/# Create cgroupmkdir -p /sys/fs/cgroup/mygroupecho 1000000 > /sys/fs/cgroup/mygroup/cpu.maxecho 500000000 > /sys/fs/cgroup/mygroup/memory.maxecho $$ > /sys/fs/cgroup/mygroup/cgroup.procs
# View cgroup of processcat /proc/1234/cgroup38.7 System Information
Section titled “38.7 System Information”System Info Commands
Section titled “System Info Commands”# Kernel and OSuname -a # All infouname -r # Kernel versionuname -m # Architecturecat /etc/os-release # OS details
# Hardwarelspci # PCI deviceslsusb # USB deviceslscpu # CPU infolsscsi # SCSI deviceslsmem # Memory devices
# Kernel moduleslsmod # Loaded modulesmodinfo module_name # Module infosudo modprobe module_name # Load modulesudo modprobe -r module_name # Unload
# Device infolsblk # Block devicescat /proc/cpuinfo # CPU detailscat /proc/meminfo # Memory details38.8 Comprehensive Monitoring Scripts
Section titled “38.8 Comprehensive Monitoring Scripts”System Health Script
Section titled “System Health Script”#!/bin/bashecho "=== System Health Check ==="echo ""
echo "--- Uptime ---"uptime
echo ""echo "--- Load Average ---"cat /proc/loadavg
echo ""echo "--- Memory ---"free -h
echo ""echo "--- Disk Usage ---"df -h | grep -v tmpfs
echo ""echo "--- CPU Usage ---"top -bn1 | grep "Cpu(s)" | awk '{print "CPU: " $2}'
echo ""echo "--- Top Memory Users ---"ps aux --sort=-%mem | head -6 | awk '{print $6/1024 " MB - " $11}'
echo ""echo "--- Top CPU Users ---"ps aux --sort=-%cpu | head -6 | awk '{print $3 "% - " $11}'
echo ""echo "--- Network Connections ---"ss -tunapl | wc -lecho "connections"
echo ""echo "--- Failed Login Attempts ---"lastb | head -538.9 Interview Questions
Section titled “38.9 Interview Questions”Basic Questions
Section titled “Basic Questions”-
How do you check CPU usage in Linux?
- top, htop, mpstat, sar
-
What is load average?
- Average number of runnable processes over 1/5/15 minutes
-
How do you check memory usage?
- free -h, /proc/meminfo, top
-
What is the difference between buffers and cache?
- Buffers: block device I/O; Cache: page cache for files
-
How do you find which process is using most memory?
- ps aux —sort=-%mem, htop
Intermediate Questions
Section titled “Intermediate Questions”-
What is iowait and what causes it?
- CPU waiting for I/O; caused by slow disk or high I/O load
-
How do you monitor disk I/O per process?
- iotop, pidstat -d
-
What is the difference between iostat %util and svctm?
- %util: device utilization; svctm: service time (deprecated)
-
How do you trace system calls of a process?
- strace -p PID
-
What is the difference between ss and netstat?
- ss is newer, faster; netstat is deprecated
Advanced Questions
Section titled “Advanced Questions”-
Explain Linux memory management
- Page cache, buffers, swap, available memory concept
-
What is the relationship between load average and CPU usage?
- Load includes running + waiting processes; CPU% shows actual execution
-
How do you diagnose high iowait?
- Check iostat, iotop, identify I/O-heavy processes
-
What are some methods to reduce system load?
- Optimize processes, reduce I/O, add hardware, tune kernel params
-
Explain the different CPU states in mpstat
- usr, nice, sys, iowait, irq, softirq, steal, guest, idle
Summary
Section titled “Summary” Quick Reference+------------------------------------------------------------------+| || CPU Monitoring: || +----------------------------------------------------------+ || | top, htop | Real-time process monitor | || | mpstat -P ALL | Per-CPU statistics | || | sar -u | Historical CPU data | || | vmstat | Virtual memory + CPU | || +----------------------------------------------------------+ || || Memory Monitoring: || +----------------------------------------------------------+ || | free -h | Memory usage summary | || | cat /proc/meminfo | Detailed memory info | || | smem | Per-process memory | || | slabtop | Kernel slab cache | || +----------------------------------------------------------+ || || Disk Monitoring: || +----------------------------------------------------------+ || | df -h | Disk space | || | du -sh | Directory sizes | || | iostat -x | I/O statistics | || | iotop | Per-process I/O | || | smartctl | Disk health | || +----------------------------------------------------------+ || || Network Monitoring: || +----------------------------------------------------------+ || | ss -tunapl | Socket statistics | || | iftop | Bandwidth by host | || | nethogs | Per-process bandwidth | || | tcpdump | Packet capture | || +----------------------------------------------------------+ || || Process Monitoring: || +----------------------------------------------------------+ || | ps aux | Process listing | || | lsof | Open files | || | strace | System call tracing | || | pstree | Process tree | || +----------------------------------------------------------+ || |+------------------------------------------------------------------+