Skip to content

Profiling_tools

Profiling is the process of measuring program performance to identify bottlenecks and optimization opportunities. This chapter covers essential profiling tools for C++ applications.

Profiling helps you understand:

  • Hot spots: Where your program spends most of its time
  • Memory usage: How much memory is being used and where
  • Call patterns: Function call frequencies and relationships
  • Cache performance: Cache hit/miss rates
// DON'T guess - profile!
// Common mistakes:
// 1. Optimizing without measurement
// 2. Focusing on wrong bottlenecks
// 3. Making code less readable without benefit
// Example: Which is faster?
int sum1 = 0;
for (int i = 0; i < n; i++) {
sum1 += i * i;
}
// vs
int sum2 = 0;
for (int i = 0; i < n; i++) {
for (int j = 0; j < i; j++) {
sum2 += j;
}
}
// Profile will tell you!

The classic Unix profiling tool included with GCC.

Terminal window
# 1. Compile with profiling flag
g++ -pg -g -O2 -o myprogram myprogram.cpp utils.cpp
# 2. Run your program (creates gmon.out)
./myprogram
# 3. Analyze the profile
gprof myprogram gmon.out > profile.txt
% cumulative self self total
time seconds seconds calls ms/call ms/call name
25.00 0.50 0.25 main [1]
20.00 0.90 0.20 100 2.00 5.00 process_data() [2]
15.00 1.20 0.15 50 3.00 3.00 calculate() [3]
10.00 1.40 0.10 1000 0.10 0.10 helper() [4]
FieldMeaning
% timePercentage of total execution time
cumulative secondsTotal time including called functions
self secondsTime spent in this function only
callsNumber of times function was called
ms/callAverage milliseconds per call
// GPROF limitations:
// 1. Doesn't profile functions called less than certain threshold
// 2. Can't profile inline functions accurately
// 3. Poor support for multi-threaded programs
// 4. Only works with dynamic linking in some cases
// Better alternatives:
- Valgrind (Callgrind)
- perf (Linux)
- Intel VTune
- Google Performance Tools

Comprehensive memory debugging and profiling framework.

Terminal window
# Ubuntu/Debian
sudo apt-get install valgrind
# macOS
brew install valgrind
# Fedora/RHEL
sudo dnf install valgrind
Terminal window
# Basic memory leak check
valgrind --leak-check=full ./myprogram
# Show detailed leak information
valgrind --leak-check=full --show-leak-kinds=all ./myprogram
# Track origin of uninitialized values
valgrind --track-origins=yes ./myprogram
// 1. Invalid write
int* p = new int[10];
p[10] = 5; // Out of bounds!
// 2. Use after free
int* p = new int;
delete p;
*p = 5; // Use after free!
// 3. Memory leak
void leak() {
int* p = new int[100];
// Missing delete!
}
// 4. Mismatched new/delete
int* p = new int[10];
free(p); // Wrong! Use delete[]
Terminal window
# Generate call graph
valgrind --tool=callgrind ./myprogram
# This creates callgrind.out.1234
# View with kcachegrind:
kcachegrind callgrind.out.1234
Terminal window
# Cache simulation
valgrind --tool=cachegrind ./myprogram
# Heap profiler
valgrind --tool=heap ./myprogram
# Thread debugger
valgrind --tool=helgrind ./myprogram

Powerful profiler using CPU hardware counters.

Terminal window
# Ubuntu/Debian
sudo apt-get install linux-tools-common linux-tools-generic
# Verify
perf --version
Terminal window
# Record CPU profile
perf record -g ./myprogram
# View results (interactive)
perf report
# Live monitoring
perf top
# Show with source annotation
perf annotate
Terminal window
# CPU cycles (default)
perf record ./myprogram
# Cache events
perf record -e cache-references -e cache-misses ./myprogram
# Branch prediction
perf record -e branch-instructions -e branch-misses ./myprogram
# Context switches
perf record -e context-switches ./myprogram
Terminal window
# Overall statistics
perf stat ./myprogram
# Run multiple times
perf stat -r 5 ./myprogram
# Detailed output
perf stat -d ./myprogram

Example output:

Performance counter stats for './myprogram':
1.234567 seconds time elapsed
500,000 cycles
100,000 instructions
0.20 instructions per cycle
50,000 cache references
5,000 cache misses
10.00% cache miss rate
Terminal window
# Record with stack traces
perf record -g -F 99 ./myprogram
# Generate flame graph (requires flamegraph package)
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Comprehensive profiling suite from Google.

Terminal window
# Ubuntu/Debian
sudo apt-get install google-perftools libgoogle-perftools-dev
# From source
git clone https://github.com/gperftools/gperftools
cd gperftools
./configure && make && sudo make install
#include <gperftools/profiler.h>
int main() {
ProfilerStart("cpu.prof");
// Your code here
for (int i = 0; i < 1000; i++) {
process_data(i);
}
ProfilerStop();
return 0;
}
Terminal window
# Compile
g++ -o myprogram myprogram.cpp -lprofiler
# Run - creates cpu.prof
./myprogram
# Analyze
google-pprof --text ./myprogram cpu.prof
google-pprof --pdf ./myprogram cpu.prof > profile.pdf
#include <gperftools/heap-profiler.h>
int main() {
HeapProfilerStart("heap.prof");
// Code that allocates memory
std::vector<int> v;
for (int i = 0; i < 10000; i++) {
v.push_back(i);
}
HeapProfilerDump("snapshot1");
v.clear();
HeapProfilerDump("snapshot2");
HeapProfilerStop();
return 0;
}

For quick performance measurements in code.

#include <chrono>
#include <iostream>
class Timer {
public:
void start() {
start_time = std::chrono::high_resolution_clock::now();
}
void stop() {
end_time = std::chrono::high_resolution_clock::now();
}
double elapsed_ms() const {
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(
end_time - start_time
);
return duration.count() / 1000.0;
}
private:
std::chrono::high_resolution_clock::time_point start_time;
std::chrono::high_resolution_clock::time_point end_time;
};
// Usage
int main() {
Timer timer;
timer.start();
// Code to measure
intensive_computation();
timer.stop();
std::cout << "Time: " << timer.elapsed_ms() << " ms" << std::endl;
return 0;
}
#include <chrono>
#include <iostream>
#define BENCHMARK(code) \
do { \
auto start = std::chrono::high_resolution_clock::now(); \
code; \
auto end = std::chrono::high_resolution_clock::now(); \
auto us = std::chrono::duration_cast<std::chrono::microseconds>(end - start); \
std::cout << #code << ": " << us.count() << " μs" << std::endl; \
} while(0)
int main() {
std::vector<int> vec(1000000, 1);
BENCHMARK(std::sort(vec.begin(), vec.end()));
BENCHMARK(std::stable_sort(vec.begin(), vec.end()));
return 0;
}
#include <chrono>
#include <iostream>
#include <iomanip>
class ScopedTimer {
public:
ScopedTimer(const char* name) : name_(name) {
start_ = std::chrono::high_resolution_clock::now();
}
~ScopedTimer() {
auto end = std::chrono::high_resolution_clock::now();
auto ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start_);
std::cout << name_ << ": " << ns.count() << " ns" << std::endl;
}
private:
const char* name_;
std::chrono::high_resolution_clock::time_point start_;
};
// Usage - automatic timing and cleanup
void process_data() {
ScopedTimer timer("process_data");
// Function body automatically timed
for (int i = 0; i < 1000000; i++) {
// do work
}
}

Commercial-grade profiler with excellent visualization.

Terminal window
# Install (Ubuntu)
sudo apt-get install intel-oneapi-vtune
# Hotspots analysis
vtune -collect hotspots ./myprogram
# Memory access analysis
vtune -collect memory-access ./myprogram
# Threading analysis
vtune -collect threading ./myprogram
# Generate report
vtune -report summary
Terminal window
# Launch VTune GUI
vtune-gui

Then:

  1. Click “New Project”
  2. Configure target application
  3. Choose analysis type (Hotspots, Memory Access, etc.)
  4. Click “Start”
  1. Debug → Performance Profiler (or Alt+F2)
  2. Select profiling type:
    • CPU Usage: Find hot functions
    • Memory Usage: Detect memory issues
    • GPU Usage: For DirectX applications
  3. Click “Start”
  4. Interact with your application
  5. View results
Terminal window
vsprof.exe --start --session:MySession --out:results.vsp
myprogram.exe
vsprof.exe --stop --session:MySession
ToolBest ForPlatformCost
gprofSimple profilingUnixFree
ValgrindMemory debuggingCross-platformFree
perfHardware countersLinuxFree
gperftoolsCPU/Heap profilingLinux/FreeBSDFree
VTuneAdvanced analysisLinux/WindowsCommercial (free for personal)
Visual StudioWindows developmentWindowsCommercial
  1. Profile in Release Mode: Use -O2 or -O3 flags for realistic results

  2. Use Real Data: Profile with realistic input data that matches production

  3. Multiple Runs: Run several times to account for variance

  4. Focus on Hotspots: Don’t optimize code that doesn’t matter

  5. Measure Before/After: Always measure before and after optimization

  6. Don’t Guess: Use profiling data, not intuition

  7. Consider All Factors: CPU, memory, cache, I/O

// Step 1: Identify the problem
// "My program is slow"
// Step 2: Profile to find bottleneck
// Use perf/gprof to find hot functions
// Step 3: Analyze the code
// Understand why it's slow
// Step 4: Optimize
// Apply optimization technique
// Step 5: Verify
// Profile again to confirm improvement
// Step 6: Repeat
// Find next bottleneck
  • Profiling identifies performance bottlenecks objectively
  • Use multiple tools for comprehensive analysis
  • Profile with realistic workloads in release mode
  • Measure before and after optimizations
  • Focus on the biggest gains first
  • Don’t optimize code that doesn’t matter