Profiling_tools

Profiling Tools and Techniques

Profiling is the process of measuring program performance to identify bottlenecks and optimization opportunities. This chapter covers essential profiling tools for C++ applications.

Introduction to Profiling

Profiling helps you understand:

Hot spots: Where your program spends most of its time
Memory usage: How much memory is being used and where
Call patterns: Function call frequencies and relationships
Cache performance: Cache hit/miss rates

Why Profile?

// DON'T guess - profile!
// Common mistakes:
// 1. Optimizing without measurement
// 2. Focusing on wrong bottlenecks
// 3. Making code less readable without benefit

// Example: Which is faster?
int sum1 = 0;
for (int i = 0; i < n; i++) {
    sum1 += i * i;
}

// vs
int sum2 = 0;
for (int i = 0; i < n; i++) {
    for (int j = 0; j < i; j++) {
        sum2 += j;
    }
}

// Profile will tell you!

GPROF - GNU Profiler

The classic Unix profiling tool included with GCC.

Basic Usage

# 1. Compile with profiling flag
g++ -pg -g -O2 -o myprogram myprogram.cpp utils.cpp

# 2. Run your program (creates gmon.out)
./myprogram

# 3. Analyze the profile
gprof myprogram gmon.out > profile.txt

Understanding Output

%   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 25.00    0.50     0.25                             main [1]
 20.00    0.90     0.20      100    2.00     5.00  process_data() [2]
 15.00    1.20     0.15       50    3.00     3.00  calculate() [3]
 10.00    1.40     0.10     1000    0.10     0.10  helper() [4]

Key Fields Explained

Field	Meaning
% time	Percentage of total execution time
cumulative seconds	Total time including called functions
self seconds	Time spent in this function only
calls	Number of times function was called
ms/call	Average milliseconds per call

Limitations

// GPROF limitations:
// 1. Doesn't profile functions called less than certain threshold
// 2. Can't profile inline functions accurately
// 3. Poor support for multi-threaded programs
// 4. Only works with dynamic linking in some cases

// Better alternatives:
 - Valgrind (Callgrind)
 - perf (Linux)
 - Intel VTune
 - Google Performance Tools

Valgrind

Comprehensive memory debugging and profiling framework.

Installation

# Ubuntu/Debian
sudo apt-get install valgrind

# macOS
brew install valgrind

# Fedora/RHEL
sudo dnf install valgrind

Memcheck - Memory Error Detection

# Basic memory leak check
valgrind --leak-check=full ./myprogram

# Show detailed leak information
valgrind --leak-check=full --show-leak-kinds=all ./myprogram

# Track origin of uninitialized values
valgrind --track-origins=yes ./myprogram

Common Valgrind Errors

// 1. Invalid write
int* p = new int[10];
p[10] = 5;  // Out of bounds!

// 2. Use after free
int* p = new int;
delete p;
*p = 5;  // Use after free!

// 3. Memory leak
void leak() {
    int* p = new int[100];
    // Missing delete!
}

// 4. Mismatched new/delete
int* p = new int[10];
free(p);  // Wrong! Use delete[]

Callgrind - Performance Profiling

# Generate call graph
valgrind --tool=callgrind ./myprogram

# This creates callgrind.out.1234
# View with kcachegrind:
kcachegrind callgrind.out.1234

More Options

# Cache simulation
valgrind --tool=cachegrind ./myprogram

# Heap profiler
valgrind --tool=heap ./myprogram

# Thread debugger
valgrind --tool=helgrind ./myprogram

Perf - Linux Performance Counters

Powerful profiler using CPU hardware counters.

Installation

# Ubuntu/Debian
sudo apt-get install linux-tools-common linux-tools-generic

# Verify
perf --version

Basic Commands

# Record CPU profile
perf record -g ./myprogram

# View results (interactive)
perf report

# Live monitoring
perf top

# Show with source annotation
perf annotate

Common Events

# CPU cycles (default)
perf record ./myprogram

# Cache events
perf record -e cache-references -e cache-misses ./myprogram

# Branch prediction
perf record -e branch-instructions -e branch-misses ./myprogram

# Context switches
perf record -e context-switches ./myprogram

Perf Stat - Summary Statistics

# Overall statistics
perf stat ./myprogram

# Run multiple times
perf stat -r 5 ./myprogram

# Detailed output
perf stat -d ./myprogram

Example output:

 Performance counter stats for './myprogram':

      1.234567 seconds time elapsed

        500,000      cycles
        100,000      instructions
            0.20      instructions per cycle
         50,000      cache references
          5,000      cache misses
         10.00%      cache miss rate

Perf with Flame Graphs

# Record with stack traces
perf record -g -F 99 ./myprogram

# Generate flame graph (requires flamegraph package)
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Google Performance Tools

Comprehensive profiling suite from Google.

Installation

# Ubuntu/Debian
sudo apt-get install google-perftools libgoogle-perftools-dev

# From source
git clone https://github.com/gperftools/gperftools
cd gperftools
./configure && make && sudo make install

CPU Profiling

#include <gperftools/profiler.h>

int main() {
    ProfilerStart("cpu.prof");

    // Your code here
    for (int i = 0; i < 1000; i++) {
        process_data(i);
    }

    ProfilerStop();
    return 0;
}

# Compile
g++ -o myprogram myprogram.cpp -lprofiler

# Run - creates cpu.prof
./myprogram

# Analyze
google-pprof --text ./myprogram cpu.prof
google-pprof --pdf ./myprogram cpu.prof > profile.pdf

Heap Profiling

#include <gperftools/heap-profiler.h>

int main() {
    HeapProfilerStart("heap.prof");

    // Code that allocates memory
    std::vector<int> v;
    for (int i = 0; i < 10000; i++) {
        v.push_back(i);
    }

    HeapProfilerDump("snapshot1");
    v.clear();
    HeapProfilerDump("snapshot2");

    HeapProfilerStop();
    return 0;
}

Custom Timing with std::chrono

For quick performance measurements in code.

Basic Timer Class

#include <chrono>
#include <iostream>

class Timer {
public:
    void start() {
        start_time = std::chrono::high_resolution_clock::now();
    }

    void stop() {
        end_time = std::chrono::high_resolution_clock::now();
    }

    double elapsed_ms() const {
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(
            end_time - start_time
        );
        return duration.count() / 1000.0;
    }

private:
    std::chrono::high_resolution_clock::time_point start_time;
    std::chrono::high_resolution_clock::time_point end_time;
};

// Usage
int main() {
    Timer timer;
    timer.start();

    // Code to measure
    intensive_computation();

    timer.stop();
    std::cout << "Time: " << timer.elapsed_ms() << " ms" << std::endl;

    return 0;
}

Benchmark Macro

#include <chrono>
#include <iostream>

#define BENCHMARK(code) \
    do { \
        auto start = std::chrono::high_resolution_clock::now(); \
        code; \
        auto end = std::chrono::high_resolution_clock::now(); \
        auto us = std::chrono::duration_cast<std::chrono::microseconds>(end - start); \
        std::cout << #code << ": " << us.count() << " μs" << std::endl; \
    } while(0)

int main() {
    std::vector<int> vec(1000000, 1);

    BENCHMARK(std::sort(vec.begin(), vec.end()));
    BENCHMARK(std::stable_sort(vec.begin(), vec.end()));

    return 0;
}

RAII Timer

#include <chrono>
#include <iostream>
#include <iomanip>

class ScopedTimer {
public:
    ScopedTimer(const char* name) : name_(name) {
        start_ = std::chrono::high_resolution_clock::now();
    }

    ~ScopedTimer() {
        auto end = std::chrono::high_resolution_clock::now();
        auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start_);
        std::cout << name_ << ": " << ns.count() << " ns" << std::endl;
    }

private:
    const char* name_;
    std::chrono::high_resolution_clock::time_point start_;
};

// Usage - automatic timing and cleanup
void process_data() {
    ScopedTimer timer("process_data");

    // Function body automatically timed
    for (int i = 0; i < 1000000; i++) {
        // do work
    }
}

Intel VTune Profiler

Commercial-grade profiler with excellent visualization.

Basic Usage

# Install (Ubuntu)
sudo apt-get install intel-oneapi-vtune

# Hotspots analysis
vtune -collect hotspots ./myprogram

# Memory access analysis
vtune -collect memory-access ./myprogram

# Threading analysis
vtune -collect threading ./myprogram

# Generate report
vtune -report summary

GUI Mode

# Launch VTune GUI
vtune-gui

Then:

Click “New Project”
Configure target application
Choose analysis type (Hotspots, Memory Access, etc.)
Click “Start”

Visual Studio Profiler (Windows)

Using Visual Studio

Debug → Performance Profiler (or Alt+F2)
Select profiling type:
- CPU Usage: Find hot functions
- Memory Usage: Detect memory issues
- GPU Usage: For DirectX applications
Click “Start”
Interact with your application
View results

Command Line

vsprof.exe --start --session:MySession --out:results.vsp
myprogram.exe
vsprof.exe --stop --session:MySession

Choosing the Right Tool

Tool	Best For	Platform	Cost
gprof	Simple profiling	Unix	Free
Valgrind	Memory debugging	Cross-platform	Free
perf	Hardware counters	Linux	Free
gperftools	CPU/Heap profiling	Linux/FreeBSD	Free
VTune	Advanced analysis	Linux/Windows	Commercial (free for personal)
Visual Studio	Windows development	Windows	Commercial

Best Practices

Profile in Release Mode: Use -O2 or -O3 flags for realistic results
Use Real Data: Profile with realistic input data that matches production
Multiple Runs: Run several times to account for variance
Focus on Hotspots: Don’t optimize code that doesn’t matter
Measure Before/After: Always measure before and after optimization
Don’t Guess: Use profiling data, not intuition
Consider All Factors: CPU, memory, cache, I/O

Profiling Workflow

// Step 1: Identify the problem
// "My program is slow"

// Step 2: Profile to find bottleneck
// Use perf/gprof to find hot functions

// Step 3: Analyze the code
// Understand why it's slow

// Step 4: Optimize
// Apply optimization technique

// Step 5: Verify
// Profile again to confirm improvement

// Step 6: Repeat
// Find next bottleneck

Key Takeaways

Profiling identifies performance bottlenecks objectively
Use multiple tools for comprehensive analysis
Profile with realistic workloads in release mode
Measure before and after optimizations
Focus on the biggest gains first
Don’t optimize code that doesn’t matter