Ebpf
Chapter 84: eBPF (Extended Berkeley Packet Filter)
Section titled “Chapter 84: eBPF (Extended Berkeley Packet Filter)”Overview
Section titled “Overview”eBPF is a revolutionary technology that allows sandboxed programs to run in the Linux kernel without modifying kernel source code or loading kernel modules. Originally designed for packet filtering (Berkeley Packet Filter), eBPF has evolved into a powerful framework for networking, tracing, security, and performance analysis. Understanding eBPF is essential for modern DevOps and SRE roles working with cloud-native infrastructure and observability.
84.1 eBPF Architecture
Section titled “84.1 eBPF Architecture”┌─────────────────────────────────────────────────────────────────────────┐│ eBPF ECOSYSTEM │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ USER SPACE │ ││ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ ││ │ │ bpftrace │ │ BCC │ │ libbpf │ │ GOLANG │ │ ││ │ │ (high- │ │ (Python/ │ │ (C) │ │ BINDINGS │ │ ││ │ │ level) │ │ Lua) │ │ │ │ │ │ ││ │ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │ ││ │ │ │ │ │ │ ││ │ └──────────────┴──────────────┴──────────────┘ │ ││ │ │ │ ││ │ ┌──────▼──────┐ │ ││ │ │ bpf() syscall │ │ ││ │ └──────┬──────┘ │ ││ └───────────────────────────────┼─────────────────────────────────┘ ││ │ ││ ┌───────────────────────────────▼─────────────────────────────────┐ ││ │ KERNEL SPACE │ ││ │ ┌─────────────────────────────────────────────────────────────┐│ ││ │ │ eBPF Subsystem ││ ││ │ ├─────────────────────────────────────────────────────────────┤│ ││ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ││ ││ │ │ │JIT Comp │ │ Verifier│ │ Maps │ │Helpers │ ││ ││ │ │ │ │ │ │ │ │ │ │ ││ ││ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ ││ ││ │ └─────────────────────────────────────────────────────────────┘│ ││ │ │ ││ │ ┌─────────────────────────────────────────────────────────────┐│ ││ │ │ HOOK POINTS ││ ││ │ ├─────────────────────────────────────────────────────────────┤│ ││ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ ││ │ │ │ Kprobe │ │ Fentry │ │ Trace │ │ Socket │ ││ ││ │ │ │ │ │ │ │ point │ │ Filter │ ││ ││ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ ││ ││ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ ││ │ │ │ Perf │ │ Cgroup │ │ LSM │ │ XDP │ ││ ││ │ │ │ Events │ │ Socket │ │ │ │ │ ││ ││ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ ││ ││ │ └─────────────────────────────────────────────────────────────┘│ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘How eBPF Works
Section titled “How eBPF Works”┌─────────────────────────────────────────────────────────────────────────┐│ eBPF PROGRAM LIFECYCLE │├─────────────────────────────────────────────────────────────────────────┤│ ││ 1. WRITE ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ eBPF Program (C, Python, Go, or bpftrace DSL) │ ││ │ │ ││ │ // Example: Count syscalls │ ││ │ SEC("tracepoint/syscalls/sys_enter_read") │ ││ │ int count_read(struct pt_regs *ctx) { │ ││ │ u64 *counter = bpf_map_lookup_elem(&counter_map, &0); │ ││ │ if (counter) (*counter)++; │ ││ │ return 0; │ ││ │ } │ ││ └────────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ 2. LOAD (via bpf syscall) ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ - Program submitted to kernel │ ││ │ - BPF Verifier checks: │ ││ │ • No unsafe instructions │ ││ │ • No unbounded loops │ ││ │ • Stack bounds valid │ ││ │ • All memory accesses checked │ ││ │ - JIT compilation to native code │ ││ │ - Program attached to hook point │ ││ └────────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ 3. EXECUTE ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ When hook triggers: │ ││ │ - JIT-compiled code runs │ ││ │ - Can read kernel data via helpers │ ││ │ - Can use eBPF maps for storage │ ││ │ - Events triggered to user space │ ││ └────────────────────────────────────────────────────────────────┘ ││ │ ││ ▼ ││ 4. READ RESULTS ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ - Read from eBPF maps │ ││ │ - Per-file buffers │ ││ │ - Ring buffer for events │ ││ │ - perf events for sampling │ ││ └────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘eBPF vs Kernel Modules
Section titled “eBPF vs Kernel Modules”| Aspect | eBPF | Kernel Module |
|---|---|---|
| Safety | Verified by VM - cannot crash kernel | Can crash kernel |
| Loading | No reboot required | Must load/unload |
| Root access | Required | Required |
| Debugging | Easier - verified before execution | Harder - can panic |
| Capabilities | Limited to allowed helpers | Full kernel access |
| Update | Hot-reload possible | Requires reload |
| Lifecycle | Managed by kernel | Manual management |
84.2 eBPF Program Types
Section titled “84.2 eBPF Program Types”┌─────────────────────────────────────────────────────────────────────────┐│ eBPF PROGRAM TYPES │├─────────────────────────────────────────────────────────────────────────┤│ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ TRACING (Observability) │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ │ ││ │ kprobe/fentry - Function entry/exit │ ││ │ tracing:内核函数调用 │ ││ │ │ ││ │ kretprobe/fexit - Function return │ ││ │ │ ││ │ tracepoint - Fixed kernel trace points │ ││ │ (stable API) │ ││ │ │ ││ │ perf_event - Hardware/Software performance │ ││ │ counters │ ││ │ │ ││ │ raw_tracepoint - Low-level trace points │ ││ │ │ ││ │ LSM (Land Security - Security hooks (SELinux, │ ││ │ Module) AppArmor integration) │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ NETWORKING │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ │ ││ │ xdp (eXpress - Packet processing at NIC level │ ││ │ Data Path) (before sk_buff allocation) │ ││ │ │ ││ │ socket_filter - Filter packets at socket level │ ││ │ │ ││ │ cgroup_skb - Per-cgroup packet filtering │ ││ │ │ ││ │ cgroup_sock - Socket creation/connection hooks │ ││ │ │ ││ │ sock_ops - TCP connection events │ ││ │ │ ││ │ sk_msg - SKB message hooks │ ││ │ │ ││ │ flow_dissector - Packet flow parsing │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ CONTAINER (Security) │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ │ ││ │ landattach - Container lifecycle (OCI) │ ││ │ │ ││ │ seccomp - Syscall filtering │ ││ │ │ ││ │ bpf_iter - Iterator for kernel data │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘84.3 eBPF Maps
Section titled “84.3 eBPF Maps”┌─────────────────────────────────────────────────────────────────────────┐│ eBPF MAPS │├─────────────────────────────────────────────────────────────────────────┤│ ││ Maps provide key-value storage accessible from both kernel ││ and user space programs. ││ ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ MAP TYPES │ ││ ├────────────────────────────────────────────────────────────────┤ ││ │ │ ││ │ ┌────────────────┐ ┌────────────────┐ │ ││ │ │ Hash Map │ │ Array Map │ │ ││ │ │ Key→Value │ │ Index→Value │ │ ││ │ │ (dynamic) │ │ (fast, fixed)│ │ ││ │ └────────────────┘ └────────────────┘ │ ││ │ │ ││ │ ┌────────────────┐ ┌────────────────┐ │ ││ │ │ Per-CPU Hash │ │ Per-CPU Array │ │ ││ │ │ No locking │ │ No locking │ │ ││ │ │ needed │ │ needed │ │ ││ │ └────────────────┘ └────────────────┘ │ ││ │ │ ││ │ ┌────────────────┐ ┌────────────────┐ │ ││ │ │ Stack │ │ LRU Hash │ │ ││ │ │ (5KB limit) │ │ (eviction) │ │ ││ │ └────────────────┘ └────────────────┘ │ ││ │ │ ││ │ ┌────────────────┐ ┌────────────────┐ │ ││ │ │ Bloom │ │ Ring Buffer │ │ ││ │ │ Filter │ │ (event │ │ ││ │ │ (approximate)│ │ streaming) │ │ ││ │ └────────────────┘ └────────────────┘ │ ││ │ │ ││ │ ┌────────────────┐ ┌────────────────┐ │ ││ │ │ Hash of │ │ Program │ │ ││ │ │ Timestamps │ │ Array │ │ ││ │ └────────────────┘ └────────────────┘ │ ││ │ │ ││ └────────────────────────────────────────────────────────────────┘ ││ ││ ACCESS PATTERN ││ ┌────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ KERNEL SPACE (eBPF program) USER SPACE │ ││ │ ───────────────────────────── ─────────── │ ││ │ │ ││ │ bpf_map_lookup_elem(&map, &key) ◄───► bpf_map_lookup_elem │ ││ │ bpf_map_update_elem(&map, &key, &val, BPF_ANY) │ ││ │ bpf_map_delete_elem(&map, &key) │ ││ │ │ ││ │ Operations are atomic (for most map types) │ ││ │ │ ││ └────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘84.4 bpftrace - High-Level Tracing
Section titled “84.4 bpftrace - High-Level Tracing”Installation and Setup
Section titled “Installation and Setup”# ============================================================# INSTALLING bpftrace# ============================================================
# Ubuntu/Debiansudo apt install bpftrace
# RHEL/CentOSsudo yum install bpftrace
# Arch Linuxsudo pacman -S bpftrace
# Verify installationbpftrace --versionbpftrace -l | head -20
# Check available probesbpftrace -l # List all probesbpftrace -l '*open*' # Filter probesbpftrace -l 'kernel:*' # Kernel probes onlybpftrace -l 'usdt:*' # USDT probesbpftrace Language
Section titled “bpftrace Language”┌─────────────────────────────────────────────────────────────────────────┐│ bpftrace Syntax │├─────────────────────────────────────────────────────────────────────────┤│ ││ Structure: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ probe[,probe] │ ││ │ / condition / │ ││ │ { │ ││ │ action; │ ││ │ action; │ ││ │ } │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ Probes: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ Type │ Example │ ││ ├──────────────────────────┼──────────────────────────────────────┤ ││ │ kprobe │ kprobe:do_nanosleep │ ││ │ kretprobe │ kretprobe:do_nanosleep │ ││ │ tracepoint │ tracepoint:syscalls:sys_enter_read │ ││ │ perf event │ perf:software:page-faults │ ││ │ interval │ interval:s:1 │ ││ │ histogram │ histogram:syscall │ ││ │ USDT │ usdt:node:node:gc_start │ ││ │ watchpoint │ watch:0x5657800:rw │ ││ └──────────────────────────┴──────────────────────────────────────┘ ││ ││ Variables: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ @variable - Per-event map (auto-cleared) │ ││ │ @global - Global map (persists) │ ││ │ @percpu[] - Per-CPU map │ ││ │ $variable - Scratch (local) variable │ ││ │ $1, $2... - Positional arguments │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ Builtins: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ pid, tid, uid, gid, nsecs, cpu, comm, name, curtask, │ ││ │ args, retval, func, comm, stack, ustack, str(), buf() │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ Functions: ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ print(), join(), hist(), lhist(), count(), sum(), avg(), │ ││ │ min(), max(), delete(), clear(), zero(), time(), printf() │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘Practical bpftrace Examples
Section titled “Practical bpftrace Examples”# ============================================================# BASIC TRACING EXAMPLES# ============================================================
# 1. Trace all open() syscallsbpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename));}'
# 2. Trace file opens with process namebpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%d %s: %s\n", pid, comm, str(args->filename));}'
# 3. Count syscalls by processbpftrace -e 'tracepoint:syscalls:sys_enter_* /comm != "bpftrace"/ { @[comm] = count();}'
# 4. Measure syscall latencybpftrace -e 'tracepoint:syscalls:sys_enter_read /pid == $1/ { @start[pid] = nsecs;}tracepoint:syscalls:sys_exit_read /@start[pid]/ { @latency = hist(nsecs - @start[pid]); delete(@start[pid]);}'
# 5. Trace TCP connectionsbpftrace -e 'tracepoint:inet:sock_protinfo { printf("PID: %d, Family: %d, Protocol: %d\n", pid, args->family, args->protocol);}'
# 6. Trace memory allocationsbpftrace -e 'kr:mm_page_alloc { @alloc_bytes = sum(args->gfp_flags & __GFP_DIRECT_RECLAIM ? args->nr_pages * 4096 : 0);}'
# 7. Block I/O latency histogrambpftrace -e 'blk_mq_start_request /args/ { @start[args->rq_disk, args->blk_rq_sector] = nsecs;}blk_mq_complete_request /@start[args->rq_disk, args->blk_rq_sector]/ { @io_latency = hist((nsecs - @start[args->rq_disk, args->blk_rq_sector]) / 1000); delete(@start[args->rq_disk, args->blk_rq_sector]);}'
# 8. Context switch analysisbpftrace -e 'scheduler:sched_switch { @[prev_comm] = count(); @switch_latency = hist(nsecs - prev_sched_time);}'
# 9. Page fault analysisbpftrace -e 'ext4:mmpage_filemap_add_to_page_cache /args/ { @page_faults[comm] = count();}Advanced bpftrace Scripts
Section titled “Advanced bpftrace Scripts”# ============================================================# ADVANCED bpftrace SCRIPTS# ============================================================
#biolatency.bt - Block I/O latency#!/usr/bin/env bpftrace
BEGIN{ printf("Tracing block I/O latency... Hit Ctrl-C to end.\n");}
kr:blk_mq_start_request{ @start = nsecs;}
kr:blk_mq_complete_request/@start/{ @latency = hist((nsecs - @start) / 1000); delete(@start);}
END{ printf("\nBlock I/O latency (microseconds):\n"); print(@latency); clear(@latency);}
# ============================================================# fileslower.bt - Slow file I/O#!/usr/bin/env bpftrace
tracepoint:syscalls:sys_enter_read,tracepoint:syscalls:sys_enter_write/args->fd >= 0/{ @files[args->fd] = args->filename; @start_time[args->fd] = nsecs;}
tracepoint:syscalls:sys_exit_read,tracepoint:syscalls:sys_exit_write/@start_time[args->fd] && (nsecs - @start_time[args->fd]) > 10000/{ $lat = (nsecs - @start_time[args->fd]) / 1000; $fd = args->fd; printf("%s %s (%s) %d µs\n", comm, str(@files[$fd]), probe == "tracepoint:syscalls:sys_exit_read" ? "R" : "W", $lat); delete(@files[$fd]); delete(@start_time[$fd]);}
# ============================================================# offcputime.bt - Off-CPU Time#!/usr/bin/env bpftrace
kernel.function("finish_task_switch"){ $prev = (struct task_struct *)ctx->args[0]; $prev_pid = $prev->pid; @start[$prev_pid] = $prev->se.sum_exec_runtime;}
kernel.function("schedule")/@start[pid]/{ $now = (uint64_t)ctx->args[0]; @total[$prev_pid] = $now - @start[$prev_pid]; delete(@start[$prev_pid]);}
END{ printf("\nOff-CPU time by process:\n"); printf("%-20s %s\n", "COMM", "TOTAL TIME (ms)"); printf("%-20s %s\n", "----", "---------------"); sort(@total, 10); printf("\n"); printf("%-20s %s\n", "COMM", "TOTAL TIME (ms)"); printf("%-20s %s\n", "----", "---------------"); delete(@total);}84.5 BCC (BPF Compiler Collection)
Section titled “84.5 BCC (BPF Compiler Collection)”BCC Overview
Section titled “BCC Overview”┌─────────────────────────────────────────────────────────────────────────┐│ BCC TOOLS OVERVIEW │├─────────────────────────────────────────────────────────────────────────┤│ ││ BCC provides Python/Lua bindings for eBPF with many ││ pre-built tools for common observability tasks. ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ I/O ANALYSIS │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ biosnoop - Block I/O by request │ ││ │ biostacks - Block I/O with kernel stacks │ ││ │ iostat - Per-disk I/O statistics │ ││ │ readsnoop - Trace read() syscalls │ ││ │ writesnoop - Trace write() syscalls │ ││ │ filelife - Track file lifetime │ ││ │ fileslower - Trace slow file I/O │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ NETWORK ANALYSIS │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ tcplife - TCP connection lifecycle │ ││ │ tcpretrans - TCP retransmissions │ ││ │ tcpconnect - Outgoing TCP connections │ ││ │ tcpaccept - Incoming TCP connections │ ││ │ socktop - Top socket activity │ ││ │ xdpdrop - XDP packet drops │ ││ │ xdpsyscount - XDP statistics │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ LATENCY ANALYSIS │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ offcputime - Off-CPU time (blocking) │ ││ │ offwaketime - Off-CPU with wakeup information │ ││ │ runqlat - Scheduler run queue latency │ ││ │ schedlat - Scheduler latency │ ││ │ biolatency - Block I/O latency │ ││ │ ext4slower - ext4 slow operations │ ││ │ nfsslower - NFS slow operations │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ MEMORY ANALYSIS │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ memleak - Memory leak detector │ ││ │ oomkill - OOM killer traces │ ││ │ pagecache - Page cache statistics │ ││ │ drsnoop - Direct reclaim events │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ SECURITY/SYSCOUNT │ ││ ├─────────────────────────────────────────────────────────────────┤ ││ │ syscount - System call counts │ ││ │ euprobe - Userspace probes │ ││ │ funclatency - Function latency │ ││ │ llcstat - Last Level Cache statistics │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘Installing and Using BCC
Section titled “Installing and Using BCC”# ============================================================# INSTALLING BCC# ============================================================
# Ubuntusudo apt install bpfcc-tools linux-headers-$(uname -r)
# RHEL/CentOSsudo yum install bcc-tools kernel-devel-$(uname -r)
# Verifyls /usr/share/bcc/ # Exampleswhich funclatencyman funclatency
# ============================================================# COMMON BCC TOOLS EXAMPLES# ============================================================
# Trace file readssudo /usr/share/bcc/examples/io/readsnoop
# Trace TCP connectionssudo /usr/share/bcc/examples/networking/tcpconnect
# Block I/O latencysudo /usr/share/bcc/examples/io/biolatency
# Memory leak detectionsudo /usr/share/bcc/examples/memleak/memleak -p $(pgrep -f myapp)
# System call frequencysudo /usr/share/bcc/examples/syscount/syscount
# Scheduler latencysudo /usr/share/bcc/examples/sched/runnlatencyWriting BCC Programs (Python)
Section titled “Writing BCC Programs (Python)”#!/usr/bin/env python3# Example: BCC Python program for tracing syscalls
from bcc import BPF
program = """#include <linux/sched.h>
// Count syscalls by processBPF_HASH(counts, u32, u64);
int trace_sys_enter(struct pt_regs *ctx) { u32 pid = bpf_get_current_pid_tgid(); u64 *p = counts.lookup_or_init(&pid, &empty); (*p)++; return 0;}"""
bpf = BPF(text=program)syscall = bpf.get_syscall_fnname("read")bpf.attach_kprobe(event=syscall, fn_name="trace_sys_enter")
print("Tracing syscalls... Press Ctrl+C to exit")
while True: try: sleep(2) except KeyboardInterrupt: pass
print("\n%-20s %-10s" % ("PID", "COUNT")) print("%-20s %-10s" % ("---", "-----"))
for pid, count in sorted(bpf["counts"].items(), key=lambda kv: -kv[1].value)[:10]: print("%-20d %-10d" % (pid.value, count.value))
bpf["counts"].clear()84.6 XDP (eXpress Data Path)
Section titled “84.6 XDP (eXpress Data Path)”XDP Overview
Section titled “XDP Overview”┌─────────────────────────────────────────────────────────────────────────┐│ XDP (Express Data Path) │├─────────────────────────────────────────────────────────────────────────┤│ ││ XDP allows eBPF programs to process packets at the earliest ││ possible point - directly after receiving from the NIC, ││ before kernel networking stack allocation of sk_buff. ││ ││ PACKET PROCESSING STAGES ││ ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ NIC ──► Driver ──► XDP ──► Socket Buffer ──► Network Stack │ ││ │ │ (sk_buff) │ ││ │ (eBPF here) Allocation happens │ ││ │ │ here │ ││ │ 10-100x faster │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ XDP ACTIONS ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ XDP_DROP - Drop packet (DDoS mitigation, firewall) │ ││ │ XDP_PASS - Pass to kernel network stack │ ││ │ XDP_TX - Bounce packet back same interface │ ││ │ XDP_REDIRECT - Redirect to another interface/queue │ ││ │ XDP_ABORTED - Drop with trace point │ ││ │ XDP_UNKNOWN - Unknown (error) │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ USE CASES ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ - DDoS mitigation │ ││ │ - Load balancing │ ││ │ - Packet filtering/firewall │ ││ │ - Traffic steering │ ││ │ - Network telemetry │ ││ │ - Edge computing │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘XDP Example
Section titled “XDP Example”// xdp_drop_browsers.c - Drop HTTP traffic from browsers#include <linux/bpf.h>#include <bpf/bpf_helpers.h>#include <linux/if_ether.h>#include <linux/ip.h>#include <linux/tcp.h>
static inline int parse_ipv4(void *data, __u64 off, void *data_end) { struct iphdr *iph = data + off; if (iph + 1 > data_end) return 0; return iph->protocol;}
SEC("xdp_drop_browsers")int xdp_drop_browsers(struct xdp_md *ctx) { void *data = (void *)(long)ctx->data; void *data_end = (void *)(long)ctx->data_end; struct ethhdr *eth = data;
if (data + sizeof(*eth) > data_end) return XDP_PASS;
if (eth->h_proto != __constant_htons(ETH_P_IP)) return XDP_PASS;
__u64 ip_off = sizeof(*eth); __u8 protocol = parse_ipv4(data, ip_off, data_end);
if (protocol == IPPROTO_TCP) { struct tcphdr *tcp = data + ip_off + sizeof(struct iphdr); if (tcp + 1 <= data_end) { // Drop traffic to HTTP (port 80) or HTTPS (443) if (ntohs(tcp->dest) == 80 || ntohs(tcp->dest) == 443) { return XDP_DROP; } } }
return XDP_PASS;}
char _license[] SEC("license") = "GPL";Loading and Using XDP
Section titled “Loading and Using XDP”# ============================================================# LOADING XDP PROGRAMS# ============================================================
# Using ip command (kernel 5.18+)ip link set dev eth0 xdp obj xdp_drop_browsers.o sec xdp_drop_browsers
# Check XDP statusip link show eth0
# Remove XDP programip link set dev eth0 xdp off
# Using iproute2 (older kernels)ip link set dev eth0 xdpgeneric obj xdp_drop_browsers.o sec xdp_drop_browsers
# Using perfperf record -a -g -e xdp:xdp_exception &# Triggers when XDP drops/redirects packets
# View XDP statscat /sys/class/net/eth0/xdp/statistics84.7 Security with eBPF
Section titled “84.7 Security with eBPF”LSM (Linux Security Module) Hooks
Section titled “LSM (Linux Security Module) Hooks”┌─────────────────────────────────────────────────────────────────────────┐│ eBPF for SECURITY │├─────────────────────────────────────────────────────────────────────────┤│ ││ eBPF can be used for security monitoring and enforcement via ││ Linux Security Module (LSM) hooks. ││ ││ AVAILABLE HOOKS ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ bpf_lsm_task_alloc - Task allocation │ ││ │ bpf_lsm_file_permission - File permission checks │ ││ │ bpf_lsm_file_ioctl - File ioctl operations │ ││ │ bpf_lsm_file_lock - File locking │ ││ │ bpf_lsm_super_permission - Superblock permissions │ ││ │ bpf_lsm_bprm_check_security - Binary execution │ ││ │ bpf_lsm_sb_mount - Mount operations │ ││ │ bpf_lsm_umount - Unmount operations │ ││ │ bpf_lsm_inode_permission - Inode permission checks │ ││ │ bpf_lsm_inode_setattr - Inode attribute changes │ ││ │ bpf_lsm_inode_create - Inode creation │ ││ │ bpf_lsm_inode_link - Hard link creation │ ││ │ bpf_lsm_inode_unlink - File unlink │ ││ │ bpf_lsm_path_truncate - Path truncation │ ││ │ bpf_lsm_path_mkdir - Directory creation │ ││ │ bpf_lsm_path_rmdir - Directory removal │ ││ │ bpf_lsm_path_rename - Rename operations │ ││ │ bpf_lsm_path_chmod - Permission changes │ ││ │ bpf_lsm_path_chown - Ownership changes │ ││ │ bpf_lsm_sb_read_super - Filesystem mount │ ││ │ bpf_lsm_sk_alloc - Socket allocation │ ││ │ bpf_lsm_sk_free - Socket free │ ││ │ bpf_lsm_sock_accept - Socket accept │ ││ │ bpf_lsm_socket_connect - Socket connect │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ ││ SECURITY TOOLS USING eBPF ││ ┌─────────────────────────────────────────────────────────────────┐ ││ │ │ ││ │ Falco - Container runtime security │ ││ │ Tetragon - Security observability │ ││ │ Tracee - Container forensics │ ││ │ Cilium - Network security (API-aware) │ ││ │ Pixie - Observability platform │ ││ │ │ ││ └─────────────────────────────────────────────────────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────────┘Container Security with eBPF
Section titled “Container Security with eBPF”# ============================================================# Falco - Container Security with eBPF# ============================================================
# Install Falcosudo apt install falco# Or: sudo docker run -d --name falco -v /var/run/docker.sock:/var/run/docker.sock \# -v /dev:/dev --cap-add SYS_PTRACE falcosecurity/falco
# Start Falcosudo falco
# Example Falco rules for container securitycat > /etc/falco/falco_rules.yaml << 'EOF'- rule: Detect shell in container desc: A shell was spawned in a container condition: container.id != host and proc.name = bash output: "Shell detected in container (user=%user.name container=%container.id shell=%proc.name)" priority: WARNING
- rule: Sensitive file access desc: Access to sensitive files condition: container.id != host and (fd.name=/etc/shadow or fd.name=/etc/passwd) output: "Sensitive file accessed (file=%fd.name user=%user.name)" priority: CRITICAL
- rule: Network connection outside container desc: Container making network connections condition: container.id != host and evt.type=connect output: "Network connection from container (container=%container.id target=%fd.name)" priority: NOTICEEOF84.8 Production Use Cases
Section titled “84.8 Production Use Cases”Performance Troubleshooting
Section titled “Performance Troubleshooting”#!/bin/bash# Complete performance analysis using eBPF
echo "=== System-wide eBPF Performance Analysis ==="
echo -e "\n--- 1. CPU Analysis (runqlat) ---"# Run queue latencysudo /usr/share/bcc/examples/sched/runqlat 2>/dev/null || echo "BCC not available"
echo -e "\n--- 2. Block I/O (biolatency) ---"sudo /usr/share/bcc/examples/io/biolatency 2>/dev/null || echo "BCC not available"
echo -e "\n--- 3. Memory Leaks (memleak) ---"# Quick 10-second checksudo /usr/share/bcc/examples/memleak/memleak -d 10 2>/dev/null || echo "BCC not available"
echo -e "\n--- 4. Network Analysis (tcpretrans) ---"sudo /usr/share/bcc/examples/networking/tcpretrans 2>/dev/null || echo "BCC not available"
echo -e "\n--- 5. Syscall Frequency (syscount) ---"sudo /usr/share/bcc/examples/syscount/syscount -p 5 2>/dev/null || echo "BCC not available"Building Custom eBPF Tools
Section titled “Building Custom eBPF Tools”#!/bin/bash# Building eBPF tools from scratch
# Prerequisitessudo apt install clang llvm libbpf-dev linux-headers-$(uname -r)
# Directory structuremkdir -p ~/ebpf/{src,include,scripts}cd ~/ebpf
# Example bpftrace toolcat > scripts/tracesnoop.bt << 'EOF'#!/usr/bin/env bpftrace
BEGIN{ printf("Tracing open() syscalls. Ctrl-C to end.\n");}
tracepoint:syscalls:sys_enter_open,tracepoint:syscalls:sys_enter_openat{ printf("%-6d %-16s %s\n", pid, comm, str(args->filename));}EOF
# Run itchmod +x scripts/tracesnoop.btsudo bpftrace scripts/tracesnoop.bt84.9 Interview Questions
Section titled “84.9 Interview Questions”┌─────────────────────────────────────────────────────────────────────────┐│ eBPF INTERVIEW QUESTIONS │├─────────────────────────────────────────────────────────────────────────┤ │Q1: What is eBPF and how does it differ from the original BPF? │ │A1: │- Original BPF (cBPF): Designed for packet filtering in BSD/Linux │ - Limited instruction set (~100 instructions │ - Used for tcpdump-style packet filtering │ │- eBPF (extended BPF): Virtual machine in kernel │ - 64-bit architecture (vs 32-bit cBPF) │ - More instructions (~1000+), more registers (11 vs 2) │ - Can access kernel data via helper functions │ - Used for: tracing, networking, security │ - JIT compiled for native performance │ │─────────────────────────────────────────────────────────────────────────┤ │Q2: Explain the eBPF verification process. │ │A2: │Before an eBPF program can run, it goes through verification: │1. Load: Program submitted via bpf() syscall │2. Check: VM checks all instructions are valid │3. Bounds: All memory accesses checked and bounds-verified │4. Loop detection: No unbounded loops allowed │5. Stack: Stack pointer stays within bounds (max 512 bytes) │6. Function calls: Only allowed helper functions │7. JIT: After verification, JIT compiles to native code │ │If any check fails, program is rejected - prevents kernel crashes │ │─────────────────────────────────────────────────────────────────────────┤ │Q3: What are eBPF maps and what are they used for? │ │A3: │eBPF maps are key-value data structures shared between kernel │and user space: │- Hash maps, array maps, stack traces, ring buffers │- Allow state persistence across program invocations │- Enable communication between eBPF programs and user space │- Types: hash, array, per-CPU hash, LRU, bloom filter, ring buffer │- Accessed via helpers: bpf_map_lookup_elem, bpf_map_update_elem │ │─────────────────────────────────────────────────────────────────────────┤ │Q4: What is XDP and why is it important? │ │A4: │XDP (eXpress Data Path): │- Processes packets at earliest point in Linux networking │- Runs directly after NIC receives packet, before sk_buff allocation │- Benefits: │ - 10-100x faster than traditional packet processing │ - Lower latency, better DDoS mitigation │ - Can drop, pass, redirect, or transmit packets │- Use cases: load balancing, firewall, DDoS mitigation, telemetry │ │─────────────────────────────────────────────────────────────────────────┤ │Q5: How would you trace a performance issue using eBPF? │ │A5: │Step 1: Identify the subsystem (CPU, memory, I/O, network) │Step 2: Use appropriate eBPF tool: │ - CPU: offcputime, runqlat, schedlat │ - Memory: memleak, pagefaults │ - I/O: biolatency, iostat, fileslower │ - Network: tcplife, tcpretrans, xdpsnoop │Step 3: Drill down with more specific tracing │Step 4: Use stack traces to identify root cause │Step 5: Compare with baseline measurements │ │─────────────────────────────────────────────────────────────────────────┤ │Q6: What are the differences between kprobe and fentry? │ │A6: │| Aspect | kprobe/kretprobe | fentry/fexit | │|----------------|---------------------|------------------------------| │| API | Legacy | New (kernel 5.5+) | │| Performance | Slower | Faster | │| Kernel impact | More intrusive | Less intrusive | │| Return access | kretprobe only | fexit has return value | │| pt_regs | Must handle manually| Clean interface | || Reliability | Can fail | More reliable | │ │─────────────────────────────────────────────────────────────────────────┤ │Q7: What is bpftrace and when would you use it vs BCC? │ │A7: │bpftrace: │- High-level DSL, one-liners and scripts │- Fast prototyping, quick analysis │- Good for ad-hoc debugging │- Less control over optimization │ │BCC: │- Full Python/Lua bindings │- More complex programs, production tools │- Better performance for long-running programs │- More control over eBPF program behavior │ │Use bpftrace for quick debugging, BCC for production monitoring │ │─────────────────────────────────────────────────────────────────────────┤ │Q8: How does eBPF contribute to container security? │ │A8: │- Runtime security monitoring via LSM hooks │- Detect suspicious activities in containers │- Examples: │ - Falco: Container security (file access, syscalls, network) │ - Tetragon: Security observability (runtime enforcement) │ - Cilium: Network policies, API-aware security │- Can enforce policies at kernel level (harder to bypass) │- Zero-trust networking within clusters │ │─────────────────────────────────────────────────────────────────────────┤ │Q9: What are the limitations of eBPF? │ │A9: │- Cannot make blocking calls (only async helpers) │- Limited stack size (512 bytes) │- No unbounded loops (must have finite iterations) │- Cannot access arbitrary kernel memory │- Must use approved helper functions │- Requires root access to load programs │- Can have kernel compatibility issues │- Learning curve for debugging eBPF programs │ │─────────────────────────────────────────────────────────────────────────┤ │Q10: Explain how you would implement a DDoS mitigation using eBPF. │ │A10: │1. Use XDP to catch packets at earliest point │2. Track connection states in eBPF maps │3. Implement rate limiting: │ - Per-IP counters in hash map │ - Time windows (sliding window or token bucket) │ - Drop packets exceeding threshold │4. Return XDP_DROP for malicious traffic │5. Use XDP_REDIRECT to send legitimate traffic to processing │6. Log blocked attempts for analysis │7. Update rules dynamically based on threat intelligence │ │──────────────────────────────── │Q11: What is the difference─────────────────────────────────────────┤ between perf and eBPF? │ │A11: │perf: │- Hardware performance counters │- Sampling-based profiling │- Lower overhead for continuous monitoring │- Native kernel tool │ │eBPF: │- Can instrument every event (not just samples) │- More powerful, can aggregate in-kernel │- Lower overhead for high-frequency events │- More complex to write │ │Often used together: perf for overview, eBPF for details │ │─────────────────────────────────────────────────────────────────────────┤ │Q12: How would you debug an eBPF program that's not working? │ │A12: │1. Check verifier output: bpf() returns errors with details │2. Use bpftool to inspect loaded programs: │ bpftool prog show │ bpftool prog load │3. Enable tracing: bpftrace -v script.bt │4. Check kernel logs: dmesg | tail │5. Use bpftool perf to trace program events │6. Verify hook attachment: ip link show (for XDP) │7. Check map contents: bpftool map show │8. Test with minimal program first (echo "OK" return) │ │└─────────────────────────────────────────────────────────────────────────┘Quick Reference
Section titled “Quick Reference”# bpftrace Commandsbpftrace -l # List probesbpftrace -l '*open*' # Filter probesbpftrace -e 'probe { action }' # One-linerbpftrace script.bt # Run scriptbpftrace --dry-run script.bt # Parse without running
# BCC Tools/usr/share/bcc/examples/io/readsnoop # Trace reads/usr/share/bcc/examples/io/biolatency # I/O latency/usr/share/bcc/examples/net/tcplife # TCP lifespan/usr/share/bcc/examples/memleak/memleak # Memory leaks
# System Commandsip link set dev eth0 xdp obj prog.o sec xdp # Load XDPip link show eth0 # Check XDP statusbpftool prog show # List programsbpftool map show # List mapscat /sys/kernel/debug/tracing/trace_pipe # View tracesSummary
Section titled “Summary”- eBPF: Virtual machine in kernel allowing safe, programmable tracing
- Components: Verifier, JIT compiler, maps, helper functions
- Tools: bpftrace (high-level), BCC (Python/Lua), libbpf (C)
- Use Cases: Performance analysis, networking, security, observability
- XDP: Packet processing at wire speed
- Security: LSM hooks for container/runtime security
- Production: Tools like Falco, Cilium, Tetragon use eBPF
Next Chapter
Section titled “Next Chapter”Chapter 85: Linux Kernel Live Patching
Last Updated: February 2026