Awk

Chapter 22: awk - Pattern Scanning and Processing

Overview

awk is a powerful text processing language designed for pattern scanning and reporting. It processes files line by line, splitting each line into fields, making it ideal for log analysis, data extraction, and reporting in DevOps environments.

How awk Works

┌────────────────────────────────────────────────────────────────┐
│                      awk Processing Flow                       │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│   Input File                                                   │
│   ─────────                                                   │
│                                                                │
│   line 1: field1 field2 field3 field4                         │
│            ↓  ↓    ↓    ↓                                      │
│          $1   $2   $3   $4  ... $NF (number of fields)      │
│                                                                │
│   ┌─────────────────────────────────────────┐                │
│   │  BEGIN { }     - Run once before       │                │
│   │                 processing starts         │                │
│   │                                         │                │
│   │  /pattern/ { }  - Process matching      │                │
│   │                 lines                   │                │
│   │                                         │                │
│   │  { }           - Process all lines     │                │
│   │                                         │                │
│   │  END { }       - Run once after        │                │
│   │                 processing ends         │                │
│   └─────────────────────────────────────────┘                │
│                                                                │
│   Output                                                       │
│   ──────                                                       │
│                                                                │
└────────────────────────────────────────────────────────────────┘

Basic Syntax

# Basic syntax
awk 'pattern { action }' file.txt

# Using -F for field separator
awk -F',' '{ print $1 }' file.csv

# Using -v for variables
awk -v name=value '{ print name, $1 }' file.txt

# Using -f for program file
awk -f script.awk file.txt

Field Variables

$0 vs $1, $2, …

# $0 - entire line
echo "hello world" | awk '{print $0}'
# hello world

# $1 - first field
echo "hello world" | awk '{print $1}'
# hello

# $2 - second field
echo "hello world" | awk '{print $2}'
# world

# $NF - last field
echo "one two three four" | awk '{print $NF}'
# four

# $(NF-1) - second to last
echo "one two three four" | awk '{print $(NF-1)}'
# three

Number of Fields (NF)

# Print number of fields
echo "a b c d e" | awk '{print NF}'
# 5

# Print last field
echo "a b c d e" | awk '{print $NF}'
# e

Field Separator

Default Field Separator

# Default is whitespace
echo "one two three" | awk '{print $1, $2}'
# one two

# Multiple spaces handled correctly
echo "one   two    three" | awk '{print $1, $2}'
# one two

Custom Field Separator

# Using -F
awk -F',' '{print $1, $2}' file.csv

# Using FS in BEGIN
awk 'BEGIN {FS=","} {print $1, $2}' file.csv

# Regex separator
awk 'BEGIN {FS="[:,|]"} {print $1, $2}' file.txt

# Output separator
awk 'BEGIN {OFS=" - "} {print $1, $2}' file.txt

Patterns

Relational Patterns

# Lines where $1 equals "error"
awk '$1 == "error" {print}' logfile

# Lines where $3 > 100
awk '$3 > 100 {print}' data.txt

# Lines where $1 contains "warning"
awk '$1 ~ /warning/ {print}' logfile

# Lines where $1 does NOT contain "debug"
awk '$1 !~ /debug/ {print}' logfile

Pattern Examples

# Match lines containing "ERROR"
awk '/ERROR/ {print}' logfile

# Match lines starting with "2024"
awk '/^2024/ {print}' logfile

# Match lines ending with "failed"
awk '/failed$/ {print}' logfile

# Complex patterns
awk '/ERROR/ || /WARN/ {print}' logfile

Built-in Variables

Common Variables

Variable	Description
`FS`	Input field separator
`OFS`	Output field separator
`RS`	Input record separator
`ORS`	Output record separator
`NF`	Number of fields in current record
`NR`	Current record number
`FNR`	Record number in current file
`FILENAME`	Current filename
`ARGC`	Number of arguments
`ARGV`	Array of arguments

Examples

# Print line numbers
awk '{print NR, $0}' file.txt

# Print with custom field separator
awk 'BEGIN {FS=","; OFS=" | "} {print $1, $2}' file.csv

# Different separators for input/output
awk 'BEGIN {RS="\n\n"; ORS="\n\n"} {print}' file.txt

Actions and Expressions

Print Statement

# Print entire line
awk '{print}' file.txt

# Print specific fields
awk '{print $1, $3}' file.txt

# Print with text
awk '{print "User:", $1, "Status:", $2}' file.txt

# Print without newline
awk '{printf "%s ", $1}' file.txt

Printf Statement

# Formatted output
awk '{printf "%-10s %5d\n", $1, $2}' file.txt

# Numbers
awk '{printf "%.2f\n", $3}' file.txt

# Align columns
awk '{printf "%-20s %10s %10s\n", $1, $2, $3}' file.txt

Variables and Operators

Arithmetic Operators

# Addition
awk '{print $1 + $2}' file.txt

# Multiplication
awk '{print $1 * $2}' file.txt

# Division
awk '{print $1 / $2}' file.txt

# Modulo
awk '{print $1 % $2}' file.txt

String Operators

# Concatenation
awk '{print $1 "-" $2}' file.txt

# String length
awk '{print length($1)}' file.txt

# Substring
awk '{print substr($1, 1, 5)}' file.txt

Control Flow

If-Else

awk '{
    if ($2 > 50) {
        print "High:", $0
    } else {
        print "Low:", $0
    }
}' file.txt

# One-liner
awk '{if ($2 > 50) print "High"; else print "Low"}' file.txt

Loops

# For loop
awk '{
    for (i=1; i<=NF; i++) {
        print $i
    }
}' file.txt

# While loop
awk '{
    i=1
    while (i<=NF) {
        print $i
        i++
    }
}' file.txt

Arrays

Basic Arrays

# Count occurrences
awk '{count[$1]++} END {for (word in count) print word, count[word]}' file.txt

# Sum values
awk '{sum+=$1} END {print sum}' file.txt

# Average
awk '{sum+=$1; count++} END {print sum/count}' file.txt

Associative Arrays

# Using string keys
awk '{ip_count[$1]++} END {for (ip in ip_count) print ip, ip_count[ip]}' access.log

# Multiple keys
awk '{(($1,$2)):count++} END {for (k in count) print k, count[k]}' file.txt

Functions

Built-in Functions

# String functions
awk '{print toupper($1)}' file.txt
awk '{print tolower($1)}' file.txt
awk '{print length($0)}' file.txt
awk '{print substr($1, 2, 5)}' file.txt
awk '{print index($1, "test")}' file.txt
awk '{print gsub(/old/, "new", $1)}' file.txt
awk '{print split($1, arr, "-")}' file.txt

# Math functions
awk '{print sqrt($1)}' file.txt
awk '{print int($1)}' file.txt
awk '{print sin($1)}' file.txt
awk '{print rand()}'

User-Defined Functions

awk 'function max(a, b) {
    return (a > b) ? a : b
}
{print max($1, $2)}' file.txt

Practical DevOps Examples

1. Log Analysis

# Count HTTP status codes
awk '{print $9}' access.log | sort | uniq -c | sort -rn

# Find 5xx errors
awk '$9 ~ /^[56][0-9][0-9]/ {print}' access.log

# Extract top IPs
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

# Response time analysis
awk '{sum+=$NF; count++} END {print "Average:", sum/count}' access.log

2. Parse System Logs

# Extract failed login attempts
awk '/Failed password/ {print $1, $2, $3, $11}' /var/log/auth.log

# Count errors by hour
awk '/ERROR/ {print $2}' app.log | cut -d: -f1 | sort | uniq -c

# System resource alerts
awk '$5 > 90 {print "ALERT:", $0}' system.log

3. CSV Processing

# Print specific columns
awk -F',' '{print $1, $3, $5}' data.csv

# Skip header
awk -F',' 'NR>1 {print}' data.csv

# Sum column
awk -F',' 'NR>1 {sum+=$4} END {print sum}' data.csv

# Filter by column value
awk -F',' '$3 > 1000 {print}' data.csv

4. Kubernetes Logs

# Extract JSON fields
kubectl logs app | awk '{for(i=1;i<=NF;i++) if($i~/^level=/) print $i}'

# Parse timestamp
kubectl logs app | awk '{print $1, $2}' | sort | uniq -c

# Error aggregation
kubectl logs app | awk '/ERROR/ {errors[$NF]++} END {for (e in errors) print e, errors[e]}'

5. Docker Stats

# Parse docker stats output
docker stats --no-stream | awk '{print $2, $3, $4}'

# Container resource usage
docker stats --format "{{.Name}} {{.CPUPerc}} {{.MemUsage}}" | \
  awk '{if ($2 ~ /[0-9]+\.[0-9]+%/) print}'

6. AWS CLI Output

# Parse EC2 instances
aws ec2 describe-instances | \
  jq -r '.Reservations[].Instances[] | "\(.InstanceId) \(.State.Name) \(.PublicIpAddress)"'

# Using awk after jq
aws ec2 describe-instances | \
  jq -r '.Reservations[].Instances[] | "\(.Tags[] | select(.Key=="Name").Value) \(.InstanceType)"' | \
  awk '{print $1, $2}'

7. Network Analysis

# Parse netstat
netstat -ant | awk '{print $6}' | sort | uniq -c

# Parse ss output
ss -tunap | awk 'NR>1 {print $1, $5, $6}'

# Firewall logs
iptables -L -v -n | awk 'NR>2 {print $2, $3, $9}'

8. Disk Usage Analysis

# Find large files
du -sh /* 2>/dev/null | awk '{print $1, $2}' | sort -h

# Disk usage by directory
du -h /var | awk '{print $1, $2}' | sort -h

# Percentage usage
df -h | awk 'NR>1 {print $1, $5}' | awk '{gsub(/%/,""); print $1, $2}'

BEGIN and END Blocks

Using BEGIN and END

# BEGIN - runs once before processing
awk 'BEGIN {print "Starting..."} {print} END {print "Done"}' file.txt

# Initialize variables
awk 'BEGIN {sum=0; count=0} {sum+=$1; count++} END {print "Average:", sum/count}' file.txt

# Custom headers
awk 'BEGIN {print "Name\tScore\tGrade"} {print} END {print "---"}' file.txt

Multi-line Processing

# Group data
awk 'BEGIN {print "=== Report ==="}
     /ERROR/ {errors++}
     /WARN/ {warnings++}
     END {print "Errors:", errors, "Warnings:", warnings}' logfile

Complex Examples

Report Generation

#!/usr/bin/env bash
# Generate system report

echo "=== System Report ==="
echo "Date: $(date)"
echo

echo "Top Processes by Memory:"
ps aux --sort=-%mem | awk 'NR<=6 {printf "%-30s %10s %10s\n", $11, $6"MB", $3"%"}'

echo
echo "Disk Usage:"
df -h | awk 'NR>1 {printf "%-20s %10s %10s\n", $1, $3, $5}'

echo
echo "Network Connections:"
netstat -ant | awk '{print $6}' | sort | uniq -c | sort -rn

Log Aggregation

#!/usr/bin/env bash
# Aggregate application logs

LOG_DIR="/var/log/myapp"
OUTPUT="/tmp/log_summary.txt"

echo "Log Summary - $(date)" > "$OUTPUT"
echo "===================" >> "$OUTPUT"

# Count by level
echo "" >> "$OUTPUT"
echo "Messages by Level:" >> "$OUTPUT"
grep -h "level=" "$LOG_DIR"/*.log 2>/dev/null | \
  awk -F'level=' '{print $2}' | cut -d, -f1 | sort | uniq -c | sort -rn >> "$OUTPUT"

# Top errors
echo "" >> "$OUTPUT"
echo "Top 10 Errors:" >> "$OUTPUT"
grep -h "ERROR" "$LOG_DIR"/*.log 2>/dev/null | \
  awk '{print $NF}' | sort | uniq -c | sort -rn | head -10 >> "$OUTPUT"

cat "$OUTPUT"

Performance Tips

Efficiency

# Use specific patterns (reduces processing)
awk '/ERROR/ {print}' largefile.log

# Process only needed fields
awk '{print $1, $2}' largefile.log

# Use while loops carefully
awk '{while()}' largefile.log  # Can be slow

Best Practices

# Set FS in BEGIN (faster than -F)
awk 'BEGIN {FS=","} {print $1}' file.csv

# Close files when done
awk '{print > "output.txt"}' file.txt

# Use next to skip records
awk 'NR==1 {next} {print}' file.txt

Summary

In this chapter, you learned:

✅ How awk processes text (fields and records)
✅ Field variables ($0, $1, $NF, etc.)
✅ Field separators (FS, OFS)
✅ Patterns and pattern matching
✅ Built-in variables (NR, NF, etc.)
✅ Actions and expressions
✅ Control flow (if-else, loops)
✅ Arrays and associative arrays
✅ Built-in and user-defined functions
✅ BEGIN and END blocks
✅ Practical DevOps examples

Next Steps

Continue to the next chapter to learn about Process Management in Bash.

Previous Chapter: sed - Stream Editor Next Chapter: Process Management