eBPF
Chapter 84: eBPF (Extended Berkeley Packet Filter)
Section titled “Chapter 84: eBPF (Extended Berkeley Packet Filter)”Overview
Section titled “Overview”eBPF is a revolutionary technology that allows sandboxed programs to run in the Linux kernel without modifying kernel source code or loading kernel modules. Originally designed for packet filtering (Berkeley Packet Filter), eBPF has evolved into a powerful framework for networking, tracing, security, and performance analysis. Understanding eBPF is essential for modern DevOps and SRE roles working with cloud-native infrastructure and observability.
Why This Matters in DevOps/SRE
Section titled “Why This Matters in DevOps/SRE”eBPF has become essential in modern cloud-native environments:
- Observability: Tools like Falco, Cilium, and Tetragon use eBPF for security monitoring
- Performance: Zero-overhead tracing and network packet processing
- Networking: XDP provides wire-speed packet processing for load balancers
- Security: Runtime security detection without agents
- Troubleshooting: Deep system visibility without performance impact
Major CNCF projects (Cilium, Falco) rely on eBPF. Understanding it is crucial for SREs working with Kubernetes.
84.1 eBPF Architecture
Section titled “84.1 eBPF Architecture”┌─────────────────────────────────────────────────────────────────────────┐│ eBPF ECOSYSTEM │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ USER SPACE │ ││ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ ││ │ │ bpftrace │ │ BCC │ │ libbpf │ │ GOLANG │ │ ││ │ │ (high- │ │ (Python/ │ │ (C) │ │ BINDINGS │ │ ││ │ │ level) │ │ Lua) │ │ │ │ │ │ ││ │ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │ ││ │ │ │ │ │ │ ││ │ └──────────────┴──────────────┴──────────────┘ │ ││ │ │ │ ││ │ ┌──────▼──────┐ │ ││ │ │ bpf() syscall │ │ ││ │ └──────┬──────┘ │ ││ └───────────────────────────────┼─────────────────────────────────┘ ││ │ ││ ┌───────────────────────────────▼─────────────────────────────────┐ ││ │ KERNEL SPACE │ ││ │ ┌─────────────────────────────────────────────────────────────┐│ ││ │ │ eBPF Subsystem ││ ││ │ ├─────────────────────────────────────────────────────────────┤│ ││ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ││ ││ │ │ │JIT Comp │ │ Verifier│ │ Maps │ │Helpers │ ││ ││ │ │ │ │ │ │ │ │ │ │ ││ ││ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ ││ ││ │ └─────────────────────────────────────────────────────────────┘│ ││ │ │ ││ │ ┌─────────────────────────────────────────────────────────────┐│ ││ │ │ HOOK POINTS ││ ││ │ ├─────────────────────────────────────────────────────────────┤│ ││ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ ││ │ │ │ Kprobe │ │ Fentry │ │ Trace │ │ Socket │ ││ ││ │ │ │ │ │ │ │ point │ │ Filter │ ││ ││ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ ││ ││ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ ││ │ │ │ Perf │ │ Cgroup │ │ LSM │ │ XDP │ ││ ││ │ │ │ Events │ │ Socket │ │ │ │ │ ││ ││ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ ││ ││ │ └─────────────────────────────────────────────────────────────┘│ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘How eBPF Works
Section titled “How eBPF Works”┌─────────────────────────────────────────────────────────────────────────┐│ eBPF PROGRAM LIFECYCLE │├─────────────────────────────────────────────────────────────────────────┤│ ││ 1. WRITE ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ eBPF Program (C, Python, Go, or bpftrace DSL) │ ││ │ │ ││ │ // Example: Count syscalls │ ││ │ SEC("tracepoint/syscalls/sys_enter_read") │ ││ │ int count_read(struct pt_regs *ctx) { │ ││ │ u64 *counter = bpf_map_lookup_elem(&counter_map, &0); │ ││ │ if (counter) (*counter)++; │ ││ │ return 0; │ ││ │ } │ ││ └────────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ 2. LOAD (via bpf syscall) ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ - Program submitted to kernel │ ││ │ - BPF Verifier checks: │ ││ │ • No unsafe instructions │ ││ │ • No unbounded loops │ ││ │ • Stack bounds valid │ ││ │ • All memory accesses checked │ ││ │ - JIT compilation to native code │ ││ │ - Program attached to hook point │ ││ └────────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ 3. EXECUTE ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ When hook triggers: │ ││ │ - JIT-compiled code runs │ ││ │ - Can read kernel data via helpers │ ││ │ - Can use eBPF maps for storage │ ││ │ - Events triggered to user space │ ││ └────────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ 4. READ RESULTS ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ - Read from eBPF maps │ ││ │ - Per-file buffers │ ││ │ - Ring buffer for events │ ││ │ - perf events for sampling │ ││ └────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘eBPF vs Kernel Modules
Section titled “eBPF vs Kernel Modules”| Aspect | eBPF | Kernel Module |
|---|---|---|
| Safety | Verified by VM - cannot crash kernel | Can crash kernel |
| Loading | No reboot required | Must load/unload |
| Root access | Required | Required |
| Debugging | Easier - verified before execution | Harder - can panic |
| Capabilities | Limited to allowed helpers | Full kernel access |
| Update | Hot-reload possible | Requires reload |
| Lifecycle | Managed by kernel | Manual management |
84.2 eBPF Program Types
Section titled “84.2 eBPF Program Types”┌─────────────────────────────────────────────────────────────────────────┐│ eBPF PROGRAM TYPES │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ TRACING (Observability) │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ │ ││ │ kprobe/fentry - Function entry/exit │ ││ │ tracing:内核函数调用 │ ││ │ │ ││ │ kretprobe/fexit - Function return │ ││ │ │ ││ │ tracepoint - Fixed kernel trace points │ ││ │ (stable API) │ ││ │ │ ││ │ perf_event - Hardware/Software performance │ ││ │ counters │ ││ │ │ ││ │ raw_tracepoint - Low-level trace points │ ││ │ │ ││ │ LSM (Land Security - Security hooks (SELinux, │ ││ │ Module) AppArmor integration) │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ NETWORKING │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ │ ││ │ xdp (eXpress - Packet processing at NIC level │ ││ │ Data Path) (before sk_buff allocation) │ ││ │ │ ││ │ socket_filter - Filter packets at socket level │ ││ │ │ ││ │ cgroup_skb - Per-cgroup packet filtering │ ││ │ │ ││ │ cgroup_sock - Socket creation/connection hooks │ ││ │ │ ││ │ sock_ops - TCP connection events │ ││ │ │ ││ │ sk_msg - SKB message hooks │ ││ │ │ ││ │ flow_dissector - Packet flow parsing │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ CONTAINER (Security) │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ │ ││ │ landattach - Container lifecycle (OCI) │ ││ │ │ ││ │ seccomp - Syscall filtering │ ││ │ │ ││ │ bpf_iter - Iterator for kernel data │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘84.3 eBPF Maps
Section titled “84.3 eBPF Maps”┌─────────────────────────────────────────────────────────────────────────┐│ eBPF MAPS │├─────────────────────────────────────────────────────────────────────────┤│ ││ Maps provide key-value storage accessible from both kernel ││ and user space programs. ││ ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ MAP TYPES │ ││ ├────────────────────────────────────────────────────────────────┤ ││ │ │ ││ │ ┌────────────────┐ ┌────────────────┐ │ ││ │ │ Hash Map │ │ Array Map │ │ ││ │ │ Key→Value │ │ Index→Value │ │ ││ │ │ (dynamic) │ │ (fast, fixed)│ │ ││ │ └────────────────┘ └────────────────┘ │ ││ │ │ ││ │ ┌────────────────┐ ┌────────────────┐ │ ││ │ │ Per-CPU Hash │ │ Per-CPU Array │ │ ││ │ │ No locking │ │ No locking │ │ ││ │ │ needed │ │ needed │ │ ││ │ └────────────────┘ └────────────────┘ │ ││ │ │ ││ │ ┌────────────────┐ ┌────────────────┐ │ ││ │ │ Stack │ │ LRU Hash │ │ ││ │ │ (5KB limit) │ │ (eviction) │ │ ││ │ └────────────────┘ └────────────────┘ │ ││ │ │ ││ │ ┌────────────────┐ ┌────────────────┐ │ ││ │ │ Bloom │ │ Ring Buffer │ │ ││ │ │ Filter │ │ (event │ │ ││ │ │ (approximate)│ │ streaming) │ │ ││ │ └────────────────┘ └────────────────┘ │ ││ │ │ ││ │ ┌────────────────┐ ┌────────────────┐ │ ││ │ │ Hash of │ │ Program │ │ ││ │ │ Timestamps │ │ Array │ │ ││ │ └────────────────┘ └────────────────┘ │ ││ │ │ ││ └────────────────────────────────────────────────────────────────┘ ││ ││ ACCESS PATTERN ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ KERNEL SPACE (eBPF program) USER SPACE │ ││ │ ───────────────────────────── ─────────── │ ││ │ │ ││ │ bpf_map_lookup_elem(&map, &key) ◄───► bpf_map_lookup_elem │ ││ │ bpf_map_update_elem(&map, &key, &val, BPF_ANY) │ ││ │ bpf_map_delete_elem(&map, &key) │ ││ │ │ ││ │ Operations are atomic (for most map types) │ ││ │ │ ││ └────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘84.4 bpftrace - High-Level Tracing
Section titled “84.4 bpftrace - High-Level Tracing”Installation and Setup
Section titled “Installation and Setup”# ============================================================# INSTALLING bpftrace# ============================================================
# Ubuntu/Debiansudo apt install bpftrace
# RHEL/CentOSsudo yum install bpftrace
# Arch Linuxsudo pacman -S bpftrace
# Verify installationbpftrace --versionbpftrace -l | head -20
# Check available probesbpftrace -l # List all probesbpftrace -l '*open*' # Filter probesbpftrace -l 'kernel:*' # Kernel probes onlybpftrace -l 'usdt:*' # USDT probesbpftrace Language
Section titled “bpftrace Language”┌─────────────────────────────────────────────────────────────────────────┐│ bpftrace Syntax │├─────────────────────────────────────────────────────────────────────────┤│ ││ Structure: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ probe[,probe] │ ││ │ / condition / │ ││ │ { │ ││ │ action; │ ││ │ action; │ ││ │ } │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ Probes: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ Type │ Example │ ││ ├──────────────────────────┼──────────────────────────────────────┤ ││ │ kprobe │ kprobe:do_nanosleep │ ││ │ kretprobe │ kretprobe:do_nanosleep │ ││ │ tracepoint │ tracepoint:syscalls:sys_enter_read │ ││ │ perf event │ perf:software:page-faults │ ││ │ interval │ interval:s:1 │ ││ │ histogram │ histogram:syscall │ ││ │ USDT │ usdt:node:node:gc_start │ ││ │ watchpoint │ watch:0x5657800:rw │ ││ └──────────────────────────┴──────────────────────────────────────┘ ││ ││ Variables: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ @variable - Per-event map (auto-cleared) │ ││ │ @global - Global map (persists) │ ││ │ @percpu[] - Per-CPU map │ ││ │ $variable - Scratch (local) variable │ ││ │ $1, $2... - Positional arguments │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ Builtins: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ pid, tid, uid, gid, nsecs, cpu, comm, name, curtask, │ ││ │ args, retval, func, comm, stack, ustack, str(), buf() │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ Functions: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ print(), join(), hist(), lhist(), count(), sum(), avg(), │ ││ │ min(), max(), delete(), clear(), zero(), time(), printf() │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘Practical bpftrace Examples
Section titled “Practical bpftrace Examples”# ============================================================# BASIC TRACING EXAMPLES# ============================================================
# 1. Trace all open() syscallsbpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename));}'
# 2. Trace file opens with process namebpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%d %s: %s\n", pid, comm, str(args->filename));}'
# 3. Count syscalls by processbpftrace -e 'tracepoint:syscalls:sys_enter_* /comm != "bpftrace"/ { @[comm] = count();}'
# 4. Measure syscall latencybpftrace -e 'tracepoint:syscalls:sys_enter_read /pid == $1/ { @start[pid] = nsecs;}tracepoint:syscalls:sys_exit_read /@start[pid]/ { @latency = hist(nsecs - @start[pid]); delete(@start[pid]);}'
# 5. Trace TCP connectionsbpftrace -e 'tracepoint:inet:sock_protinfo { printf("PID: %d, Family: %d, Protocol: %d\n", pid, args->family, args->protocol);}'
# 6. Trace memory allocationsbpftrace -e 'kr:mm_page_alloc { @alloc_bytes = sum(args->gfp_flags & __GFP_DIRECT_RECLAIM ? args->nr_pages * 4096 : 0);}'
# 7. Block I/O latency histogrambpftrace -e 'blk_mq_start_request /args/ { @start[args->rq_disk, args->blk_rq_sector] = nsecs;}blk_mq_complete_request /@start[args->rq_disk, args->blk_rq_sector]/ { @io_latency = hist((nsecs - @start[args->rq_disk, args->blk_rq_sector]) / 1000); delete(@start[args->rq_disk, args->blk_rq_sector]);}'
# 8. Context switch analysisbpftrace -e 'scheduler:sched_switch { @[prev_comm] = count(); @switch_latency = hist(nsecs - prev_sched_time);}'
# 9. Page fault analysisbpftrace -e 'ext4:mmpage_filemap_add_to_page_cache /args/ { @page_faults[comm] = count();}Advanced bpftrace Scripts
Section titled “Advanced bpftrace Scripts”# ============================================================# ADVANCED bpftrace SCRIPTS# ============================================================
#biolatency.bt - Block I/O latency#!/usr/bin/env bpftrace
BEGIN{ printf("Tracing block I/O latency... Hit Ctrl-C to end.\n");}
kr:blk_mq_start_request{ @start = nsecs;}
kr:blk_mq_complete_request/@start/{ @latency = hist((nsecs - @start) / 1000); delete(@start);}
END{ printf("\nBlock I/O latency (microseconds):\n"); print(@latency); clear(@latency);}
# ============================================================# fileslower.bt - Slow file I/O#!/usr/bin/env bpftrace
tracepoint:syscalls:sys_enter_read,tracepoint:syscalls:sys_enter_write/args->fd >= 0/{ @files[args->fd] = args->filename; @start_time[args->fd] = nsecs;}
tracepoint:syscalls:sys_exit_read,tracepoint:syscalls:sys_exit_write/@start_time[args->fd] && (nsecs - @start_time[args->fd]) > 10000/{ $lat = (nsecs - @start_time[args->fd]) / 1000; $fd = args->fd; printf("%s %s (%s) %d µs\n", comm, str(@files[$fd]), probe == "tracepoint:syscalls:sys_exit_read" ? "R" : "W", $lat); delete(@files[$fd]); delete(@start_time[$fd]);}
# ============================================================# offcputime.bt - Off-CPU Time#!/usr/bin/env bpftrace
kernel.function("finish_task_switch"){ $prev = (struct task_struct *)ctx->args[0]; $prev_pid = $prev->pid; @start[$prev_pid] = $prev->se.sum_exec_runtime;}
kernel.function("schedule")/@start[pid]/{ $now = (uint64_t)ctx->args[0]; @total[$prev_pid] = $now - @start[$prev_pid]; delete(@start[$prev_pid]);}
END{ printf("\nOff-CPU time by process:\n"); printf("%-20s %s\n", "COMM", "TOTAL TIME (ms)"); printf("%-20s %s\n", "----", "---------------"); sort(@total, 10); printf("\n"); printf("%-20s %s\n", "COMM", "TOTAL TIME (ms)"); printf("%-20s %s\n", "----", "---------------"); delete(@total);}84.5 BCC (BPF Compiler Collection)
Section titled “84.5 BCC (BPF Compiler Collection)”BCC Overview
Section titled “BCC Overview”┌─────────────────────────────────────────────────────────────────────────┐│ BCC TOOLS OVERVIEW │├─────────────────────────────────────────────────────────────────────────┤│ ││ BCC provides Python/Lua bindings for eBPF with many ││ pre-built tools for common observability tasks. ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ I/O ANALYSIS │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ biosnoop - Block I/O by request │ ││ │ biostacks - Block I/O with kernel stacks │ ││ │ iostat - Per-disk I/O statistics │ ││ │ readsnoop - Trace read() syscalls │ ││ │ writesnoop - Trace write() syscalls │ ││ │ filelife - Track file lifetime │ ││ │ fileslower - Trace slow file I/O │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ NETWORK ANALYSIS │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ tcplife - TCP connection lifecycle │ ││ │ tcpretrans - TCP retransmissions │ ││ │ tcpconnect - Outgoing TCP connections │ ││ │ tcpaccept - Incoming TCP connections │ ││ │ socktop - Top socket activity │ ││ │ xdpdrop - XDP packet drops │ ││ │ xdpsyscount - XDP statistics │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ LATENCY ANALYSIS │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ offcputime - Off-CPU time (blocking) │ ││ │ offwaketime - Off-CPU with wakeup information │ ││ │ runqlat - Scheduler run queue latency │ ││ │ schedlat - Scheduler latency │ ││ │ biolatency - Block I/O latency │ ││ │ ext4slower - ext4 slow operations │ ││ │ nfsslower - NFS slow operations │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ MEMORY ANALYSIS │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ memleak - Memory leak detector │ ││ │ oomkill - OOM killer traces │ ││ │ pagecache - Page cache statistics │ ││ │ drsnoop - Direct reclaim events │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ SECURITY/SYSCOUNT │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ syscount - System call counts │ ││ │ euprobe - Userspace probes │ ││ │ funclatency - Function latency │ ││ │ llcstat - Last Level Cache statistics │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘Installing and Using BCC
Section titled “Installing and Using BCC”# ============================================================# INSTALLING BCC# ============================================================
# Ubuntusudo apt install bpfcc-tools linux-headers-$(uname -r)
# RHEL/CentOSsudo yum install bcc-tools kernel-devel-$(uname -r)
# Verifyls /usr/share/bcc/ # Exampleswhich funclatencyman funclatency
# ============================================================# COMMON BCC TOOLS EXAMPLES# ============================================================
# Trace file readssudo /usr/share/bcc/examples/io/readsnoop
# Trace TCP connectionssudo /usr/share/bcc/examples/networking/tcpconnect
# Block I/O latencysudo /usr/share/bcc/examples/io/biolatency
# Memory leak detectionsudo /usr/share/bcc/examples/memleak/memleak -p $(pgrep -f myapp)
# System call frequencysudo /usr/share/bcc/examples/syscount/syscount
# Scheduler latencysudo /usr/share/bcc/examples/sched/runnlatencyWriting BCC Programs (Python)
Section titled “Writing BCC Programs (Python)”#!/usr/bin/env python3# Example: BCC Python program for tracing syscalls
from bcc import BPF
program = """#include <linux/sched.h>
// Count syscalls by processBPF_HASH(counts, u32, u64);
int trace_sys_enter(struct pt_regs *ctx) { u32 pid = bpf_get_current_pid_tgid(); u64 *p = counts.lookup_or_init(&pid, &empty); (*p)++; return 0;}"""
bpf = BPF(text=program)syscall = bpf.get_syscall_fnname("read")bpf.attach_kprobe(event=syscall, fn_name="trace_sys_enter")
print("Tracing syscalls... Press Ctrl+C to exit")
while True: try: sleep(2) except KeyboardInterrupt: pass
print("\n%-20s %-10s" % ("PID", "COUNT")) print("%-20s %-10s" % ("---", "-----"))
for pid, count in sorted(bpf["counts"].items(), key=lambda kv: -kv[1].value)[:10]: print("%-20d %-10d" % (pid.value, count.value))
bpf["counts"].clear()84.6 XDP (eXpress Data Path)
Section titled “84.6 XDP (eXpress Data Path)”XDP Overview
Section titled “XDP Overview”┌─────────────────────────────────────────────────────────────────────────┐│ XDP (Express Data Path) │├─────────────────────────────────────────────────────────────────────────┤│ ││ XDP allows eBPF programs to process packets at the earliest ││ possible point - directly after receiving from the NIC, ││ before kernel networking stack allocation of sk_buff. ││ ││ PACKET PROCESSING STAGES ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ NIC ──► Driver ──► XDP ──► Socket Buffer ──► Network Stack │ ││ │ │ (sk_buff) │ ││ │ (eBPF here) Allocation happens │ ││ │ │ here │ ││ │ 10-100x faster │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ XDP ACTIONS ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ XDP_DROP - Drop packet (DDoS mitigation, firewall) │ ││ │ XDP_PASS - Pass to kernel network stack │ ││ │ XDP_TX - Bounce packet back same interface │ ││ │ XDP_REDIRECT - Redirect to another interface/queue │ ││ │ XDP_ABORTED - Drop with trace point │ ││ │ XDP_UNKNOWN - Unknown (error) │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ USE CASES ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ - DDoS mitigation │ ││ │ - Load balancing │ ││ │ - Packet filtering/firewall │ ││ │ - Traffic steering │ ││ │ - Network telemetry │ ││ │ - Edge computing │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘XDP Example
Section titled “XDP Example”// xdp_drop_browsers.c - Drop HTTP traffic from browsers#include <linux/bpf.h>#include <bpf/bpf_helpers.h>#include <linux/if_ether.h>#include <linux/ip.h>#include <linux/tcp.h>
static inline int parse_ipv4(void *data, __u64 off, void *data_end) { struct iphdr *iph = data + off; if (iph + 1 > data_end) return 0; return iph->protocol;}
SEC("xdp_drop_browsers")int xdp_drop_browsers(struct xdp_md *ctx) { void *data = (void *)(long)ctx->data; void *data_end = (void *)(long)ctx->data_end; struct ethhdr *eth = data;
if (data + sizeof(*eth) > data_end) return XDP_PASS;
if (eth->h_proto != __constant_htons(ETH_P_IP)) return XDP_PASS;
__u64 ip_off = sizeof(*eth); __u8 protocol = parse_ipv4(data, ip_off, data_end);
if (protocol == IPPROTO_TCP) { struct tcphdr *tcp = data + ip_off + sizeof(struct iphdr); if (tcp + 1 <= data_end) { // Drop traffic to HTTP (port 80) or HTTPS (443) if (ntohs(tcp->dest) == 80 || ntohs(tcp->dest) == 443) { return XDP_DROP; } } }
return XDP_PASS;}
char _license[] SEC("license") = "GPL";Loading and Using XDP
Section titled “Loading and Using XDP”# ============================================================# LOADING XDP PROGRAMS# ============================================================
# Using ip command (kernel 5.18+)ip link set dev eth0 xdp obj xdp_drop_browsers.o sec xdp_drop_browsers
# Check XDP statusip link show eth0
# Remove XDP programip link set dev eth0 xdp off
# Using iproute2 (older kernels)ip link set dev eth0 xdpgeneric obj xdp_drop_browsers.o sec xdp_drop_browsers
# Using perfperf record -a -g -e xdp:xdp_exception &# Triggers when XDP drops/redirects packets
# View XDP statscat /sys/class/net/eth0/xdp/statistics84.7 Security with eBPF
Section titled “84.7 Security with eBPF”LSM (Linux Security Module) Hooks
Section titled “LSM (Linux Security Module) Hooks”┌─────────────────────────────────────────────────────────────────────────┐│ eBPF for SECURITY │├─────────────────────────────────────────────────────────────────────────┤│ ││ eBPF can be used for security monitoring and enforcement via ││ Linux Security Module (LSM) hooks. ││ ││ AVAILABLE HOOKS ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ bpf_lsm_task_alloc - Task allocation │ ││ │ bpf_lsm_file_permission - File permission checks │ ││ │ bpf_lsm_file_ioctl - File ioctl operations │ ││ │ bpf_lsm_file_lock - File locking │ ││ │ bpf_lsm_super_permission - Superblock permissions │ ││ │ bpf_lsm_bprm_check_security - Binary execution │ ││ │ bpf_lsm_sb_mount - Mount operations │ ││ │ bpf_lsm_umount - Unmount operations │ ││ │ bpf_lsm_inode_permission - Inode permission checks │ ││ │ bpf_lsm_inode_setattr - Inode attribute changes │ ││ │ bpf_lsm_inode_create - Inode creation │ ││ │ bpf_lsm_inode_link - Hard link creation │ ││ │ bpf_lsm_inode_unlink - File unlink │ ││ │ bpf_lsm_path_truncate - Path truncation │ ││ │ bpf_lsm_path_mkdir - Directory creation │ ││ │ bpf_lsm_path_rmdir - Directory removal │ ││ │ bpf_lsm_path_rename - Rename operations │ ││ │ bpf_lsm_path_chmod - Permission changes │ ││ │ bpf_lsm_path_chown - Ownership changes │ ││ │ bpf_lsm_sb_read_super - Filesystem mount │ ││ │ bpf_lsm_sk_alloc - Socket allocation │ ││ │ bpf_lsm_sk_free - Socket free │ ││ │ bpf_lsm_sock_accept - Socket accept │ ││ │ bpf_lsm_socket_connect - Socket connect │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ SECURITY TOOLS USING eBPF ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ Falco - Container runtime security │ ││ │ Tetragon - Security observability │ ││ │ Tracee - Container forensics │ ││ │ Cilium - Network security (API-aware) │ ││ │ Pixie - Observability platform │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘Container Security with eBPF
Section titled “Container Security with eBPF”# ============================================================# Falco - Container Security with eBPF# ============================================================
# Install Falcosudo apt install falco# Or: sudo docker run -d --name falco -v /var/run/docker.sock:/var/run/docker.sock \# -v /dev:/dev --cap-add SYS_PTRACE falcosecurity/falco
# Start Falcosudo falco
# Example Falco rules for container securitycat > /etc/falco/falco_rules.yaml << 'EOF'- rule: Detect shell in container desc: A shell was spawned in a container condition: container.id != host and proc.name = bash output: "Shell detected in container (user=%user.name container=%container.id shell=%proc.name)" priority: WARNING
- rule: Sensitive file access desc: Access to sensitive files condition: container.id != host and (fd.name=/etc/shadow or fd.name=/etc/passwd) output: "Sensitive file accessed (file=%fd.name user=%user.name)" priority: CRITICAL
- rule: Network connection outside container desc: Container making network connections condition: container.id != host and evt.type=connect output: "Network connection from container (container=%container.id target=%fd.name)" priority: NOTICEEOF84.8 Production Use Cases
Section titled “84.8 Production Use Cases”Performance Troubleshooting
Section titled “Performance Troubleshooting”#!/bin/bash# Complete performance analysis using eBPF
echo "=== System-wide eBPF Performance Analysis ==="
echo -e "\n--- 1. CPU Analysis (runqlat) ---"# Run queue latencysudo /usr/share/bcc/examples/sched/runqlat 2>/dev/null || echo "BCC not available"
echo -e "\n--- 2. Block I/O (biolatency) ---"sudo /usr/share/bcc/examples/io/biolatency 2>/dev/null || echo "BCC not available"
echo -e "\n--- 3. Memory Leaks (memleak) ---"# Quick 10-second checksudo /usr/share/bcc/examples/memleak/memleak -d 10 2>/dev/null || echo "BCC not available"
echo -e "\n--- 4. Network Analysis (tcpretrans) ---"sudo /usr/share/bcc/examples/networking/tcpretrans 2>/dev/null || echo "BCC not available"
echo -e "\n--- 5. Syscall Frequency (syscount) ---"sudo /usr/share/bcc/examples/syscount/syscount -p 5 2>/dev/null || echo "BCC not available"Building Custom eBPF Tools
Section titled “Building Custom eBPF Tools”#!/bin/bash# Building eBPF tools from scratch
# Prerequisitessudo apt install clang llvm libbpf-dev linux-headers-$(uname -r)
# Directory structuremkdir -p ~/ebpf/{src,include,scripts}cd ~/ebpf
# Example bpftrace toolcat > scripts/tracesnoop.bt << 'EOF'#!/usr/bin/env bpftrace
BEGIN{ printf("Tracing open() syscalls. Ctrl-C to end.\n");}
tracepoint:syscalls:sys_enter_open,tracepoint:syscalls:sys_enter_openat{ printf("%-6d %-16s %s\n", pid, comm, str(args->filename));}EOF
# Run itchmod +x scripts/tracesnoop.btsudo bpftrace scripts/tracesnoop.bt84.9 Interview Questions
Section titled “84.9 Interview Questions”┌─────────────────────────────────────────────────────────────────────────┐│ eBPF INTERVIEW QUESTIONS │├─────────────────────────────────────────────────────────────────────────┤ │Q1: What is eBPF and how does it differ from the original BPF? │ │A1: │- Original BPF (cBPF): Designed for packet filtering in BSD/Linux │ - Limited instruction set (~100 instructions │ - Used for tcpdump-style packet filtering │ │- eBPF (extended BPF): Virtual machine in kernel │ - 64-bit architecture (vs 32-bit cBPF) │ - More instructions (~1000+), more registers (11 vs 2) │ - Can access kernel data via helper functions │ - Used for: tracing, networking, security │ - JIT compiled for native performance │ │─────────────────────────────────────────────────────────────────────────┤ │Q2: Explain the eBPF verification process. │ │A2: │Before an eBPF program can run, it goes through verification: │1. Load: Program submitted via bpf() syscall │2. Check: VM checks all instructions are valid │3. Bounds: All memory accesses checked and bounds-verified │4. Loop detection: No unbounded loops allowed │5. Stack: Stack pointer stays within bounds (max 512 bytes) │6. Function calls: Only allowed helper functions │7. JIT: After verification, JIT compiles to native code │ │If any check fails, program is rejected - prevents kernel crashes │ │─────────────────────────────────────────────────────────────────────────┤ │Q3: What are eBPF maps and what are they used for? │ │A3: │eBPF maps are key-value data structures shared between kernel │and user space: │- Hash maps, array maps, stack traces, ring buffers │- Allow state persistence across program invocations │- Enable communication between eBPF programs and user space │- Types: hash, array, per-CPU hash, LRU, bloom filter, ring buffer │- Accessed via helpers: bpf_map_lookup_elem, bpf_map_update_elem │ │─────────────────────────────────────────────────────────────────────────┤ │Q4: What is XDP and why is it important? │ │A4: │XDP (eXpress Data Path): │- Processes packets at earliest point in Linux networking │- Runs directly after NIC receives packet, before sk_buff allocation │- Benefits: │ - 10-100x faster than traditional packet processing │ - Lower latency, better DDoS mitigation │ - Can drop, pass, redirect, or transmit packets │- Use cases: load balancing, firewall, DDoS mitigation, telemetry │ │─────────────────────────────────────────────────────────────────────────┤ │Q5: How would you trace a performance issue using eBPF? │ │A5: │Step 1: Identify the subsystem (CPU, memory, I/O, network) │Step 2: Use appropriate eBPF tool: │ - CPU: offcputime, runqlat, schedlat │ - Memory: memleak, pagefaults │ - I/O: biolatency, iostat, fileslower │ - Network: tcplife, tcpretrans, xdpsnoop │Step 3: Drill down with more specific tracing │Step 4: Use stack traces to identify root cause │Step 5: Compare with baseline measurements │ │─────────────────────────────────────────────────────────────────────────┤ │Q6: What are the differences between kprobe and fentry? │ │A6: │| Aspect | kprobe/kretprobe | fentry/fexit | │|----------------|---------------------|------------------------------| │| API | Legacy | New (kernel 5.5+) | │| Performance | Slower | Faster | │| Kernel impact | More intrusive | Less intrusive | │| Return access | kretprobe only | fexit has return value | │| pt_regs | Must handle manually| Clean interface | || Reliability | Can fail | More reliable | │ │─────────────────────────────────────────────────────────────────────────┤ │Q7: What is bpftrace and when would you use it vs BCC? │ │A7: │bpftrace: │- High-level DSL, one-liners and scripts │- Fast prototyping, quick analysis │- Good for ad-hoc debugging │- Less control over optimization │ │BCC: │- Full Python/Lua bindings │- More complex programs, production tools │- Better performance for long-running programs │- More control over eBPF program behavior │ │Use bpftrace for quick debugging, BCC for production monitoring │ │─────────────────────────────────────────────────────────────────────────┤ │Q8: How does eBPF contribute to container security? │ │A8: │- Runtime security monitoring via LSM hooks │- Detect suspicious activities in containers │- Examples: │ - Falco: Container security (file access, syscalls, network) │ - Tetragon: Security observability (runtime enforcement) │ - Cilium: Network policies, API-aware security │- Can enforce policies at kernel level (harder to bypass) │- Zero-trust networking within clusters │ │─────────────────────────────────────────────────────────────────────────┤ │Q9: What are the limitations of eBPF? │ │A9: │- Cannot make blocking calls (only async helpers) │- Limited stack size (512 bytes) │- No unbounded loops (must have finite iterations) │- Cannot access arbitrary kernel memory │- Must use approved helper functions │- Requires root access to load programs │- Can have kernel compatibility issues │- Learning curve for debugging eBPF programs │ │─────────────────────────────────────────────────────────────────────────┤ │Q10: Explain how you would implement a DDoS mitigation using eBPF. │ │A10: │1. Use XDP to catch packets at earliest point │2. Track connection states in eBPF maps │3. Implement rate limiting: │ - Per-IP counters in hash map │ - Time windows (sliding window or token bucket) │ - Drop packets exceeding threshold │4. Return XDP_DROP for malicious traffic │5. Use XDP_REDIRECT to send legitimate traffic to processing │6. Log blocked attempts for analysis │7. Update rules dynamically based on threat intelligence │ │──────────────────────────────── │Q11: What is the difference─────────────────────────────────────────┤ between perf and eBPF? │ │A11: │perf: │- Hardware performance counters │- Sampling-based profiling │- Lower overhead for continuous monitoring │- Native kernel tool │ │eBPF: │- Can instrument every event (not just samples) │- More powerful, can aggregate in-kernel │- Lower overhead for high-frequency events │- More complex to write │ │Often used together: perf for overview, eBPF for details │ │─────────────────────────────────────────────────────────────────────────┤ │Q12: How would you debug an eBPF program that's not working? │ │A12: │1. Check verifier output: bpf() returns errors with details │2. Use bpftool to inspect loaded programs: │ bpftool prog show │ bpftool prog load │3. Enable tracing: bpftrace -v script.bt │4. Check kernel logs: dmesg | tail │5. Use bpftool perf to trace program events │6. Verify hook attachment: ip link show (for XDP) │7. Check map contents: bpftool map show │8. Test with minimal program first (echo "OK" return) │ │└─────────────────────────────────────────────────────────────────────────┘Quick Reference
Section titled “Quick Reference”# bpftrace Commandsbpftrace -l # List probesbpftrace -l '*open*' # Filter probesbpftrace -e 'probe { action }' # One-linerbpftrace script.bt # Run scriptbpftrace --dry-run script.bt # Parse without running
# BCC Tools/usr/share/bcc/examples/io/readsnoop # Trace reads/usr/share/bcc/examples/io/biolatency # I/O latency/usr/share/bcc/examples/net/tcplife # TCP lifespan/usr/share/bcc/examples/memleak/memleak # Memory leaks
# System Commandsip link set dev eth0 xdp obj prog.o sec xdp # Load XDPip link show eth0 # Check XDP statusbpftool prog show # List programsbpftool map show # List mapscat /sys/kernel/debug/tracing/trace_pipe # View tracesCommon Mistakes & Anti-Patterns
Section titled “Common Mistakes & Anti-Patterns”1. Running eBPF Without Root
Section titled “1. Running eBPF Without Root”WRONG:
# Trying to run bpftrace without proper permissionsbpftrace -e 'kprobe:do_nanosleep { printf("sleeping\n"); }'# Will fail: Operation not permittedCORRECT:
# Use bpftrace with proper permissionssudo bpftrace -e 'kprobe:do_nanosleep { printf("sleeping\n"); }'
# Or add user to bpf group (if available)sudo usermod -aG bpftrace $USERWhy: eBPF requires elevated privileges for kernel interaction.
2. Not Checking Kernel Support
Section titled “2. Not Checking Kernel Support”WRONG:
# Assuming eBPF is always availablebpftrace -e 'BEGIN { printf("Hello\\n"); exit(); }'CORRECT:
# Check kernel supportuname -rcat /proc/sys/kernel/bpf_stats_enabled
# Check for required featuresls /sys/kernel/debug/tracing/ls /proc/sys/net/bpf_prog_actor
# Check bpftrace capabilitiesbpftrace --versionbpftrace -l | head -20Why: eBPF features vary by kernel version and configuration.
3. Infinite Loops in eBPF Programs
Section titled “3. Infinite Loops in eBPF Programs”WRONG:
// This will be rejected by verifierSEC("kprobe/do_nanosleep")int bug_prog(struct pt_regs *ctx) { while (1) { } // Infinite loop - REJECTED return 0;}CORRECT:
// Always have termination conditionsSEC("kprobe/do_nanosleep")int good_prog(struct pt_regs *ctx) { int count = 0; if (count < 1000) { // Bounded loop count++; } return 0;}Why: eBPF verifier rejects programs with infinite loops.
4. Not Using bpftool for Debugging
Section titled “4. Not Using bpftool for Debugging”WRONG:
# Just loading and hoping it workstc filter add dev eth0 handle 1 bpf objtage.soCORRECT:
# Use bpftool to inspect and debugbpftool prog listbpftool prog show id <id>bpftool map listbpftool net show
# Check for errorsbpftool prog loadall /path/to/bpf.o type xdpWhy: bpftool helps inspect loaded programs and diagnose issues.
5. Ignoring eBPF Map Limits
Section titled “5. Ignoring eBPF Map Limits”WRONG:
// Creating map without considering limitsstruct bpf_map_def SEC("maps") my_hash = { .type = BPF_MAP_TYPE_HASH, .max_entries = 10000000, // Way too large .key_size = sizeof(int), .value_size = sizeof(struct data),};CORRECT:
// Size appropriatelystruct bpf_map_def SEC("maps") my_hash = { .type = BPF_MAP_TYPE_HASH, .max_entries = 10000, // Reasonable for use case .key_size = sizeof(int), .value_size = sizeof(struct data),};Why: Excessive map sizes waste memory and may fail to load.
Summary
Section titled “Summary”- eBPF: Virtual machine in kernel allowing safe, programmable tracing
- Components: Verifier, JIT compiler, maps, helper functions
- Tools: bpftrace (high-level), BCC (Python/Lua), libbpf (C)
- Use Cases: Performance analysis, networking, security, observability
- XDP: Packet processing at wire speed
- Security: LSM hooks for container/runtime security
- Production: Tools like Falco, Cilium, Tetragon use eBPF
Next Chapter
Section titled “Next Chapter”Chapter 85: Linux Kernel Live Patching
Last Updated: February 2026