Optimization_techniques
Optimization Techniques
Section titled “Optimization Techniques”Advanced techniques for writing high-performance C++ code. These optimizations should only be applied after profiling identifies bottlenecks.
Golden Rules of Optimization
Section titled “Golden Rules of Optimization”- Profile First: Never optimize without measurement
- Measure Again: Verify improvements
- Readability Matters: Don’t sacrifice maintainability
- Trade-offs: Understand what you’re trading
Loop Optimizations
Section titled “Loop Optimizations”Loop Unrolling
Section titled “Loop Unrolling”Reduces loop overhead by performing multiple operations per iteration:
// Naive versionvoid process_all(const std::vector<int>& data) { for (size_t i = 0; i < data.size(); i++) { process(data[i]); }}
// Partial unrolling (4x)void process_all_unrolled(const std::vector<int>& data) { size_t i = 0; size_t limit = data.size() - 3;
for (; i < limit; i += 4) { process(data[i]); process(data[i+1]); process(data[i+2]); process(data[i+3]); }
// Handle remaining for (; i < data.size(); i++) { process(data[i]); }}Loop Invariant Code Motion
Section titled “Loop Invariant Code Motion”Move computations that don’t change inside the loop:
// Before: expensive computation in loopfor (int i = 0; i < n; i++) { result = expensive_function() * i; // Bad!}
// After: compute onceauto value = expensive_function();for (int i = 0; i < n; i++) { result = value * i; // Good!}Loop Fusion (Merging)
Section titled “Loop Fusion (Merging)”Combine consecutive loops over the same range:
// Before: two separate loopsfor (int i = 0; i < n; i++) { a[i] = b[i] * 2;}for (int i = 0; i < n; i++) { c[i] = a[i] + 1;}
// After: single loopfor (int i = 0; i < n; i++) { a[i] = b[i] * 2; c[i] = a[i] + 1;}Loop Interchange
Section titled “Loop Interchange”Change loop nesting for better cache locality:
// Bad: column-major access patternconst int N = 1000;int matrix[N][N];for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { process(matrix[j][i]); // Bad cache behavior }}
// Good: row-major access patternfor (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { process(matrix[i][j]); // Good cache behavior }}Range-based for with Reserve
Section titled “Range-based for with Reserve”// Pre-allocate to avoid reallocationsstd::vector<int> results;results.reserve(input.size()); // Important!
for (const auto& item : input) { results.push_back(process(item));}Branch Prediction
Section titled “Branch Prediction”Likely/Unlikely Hints
Section titled “Likely/Unlikely Hints”Help the compiler optimize branch prediction:
#include <algorithm>
// Branch hints (GCC/Clang)if (__builtin_expect(error_occurred, 0)) { // Unlikely handle_error();}
if (__builtin_expect(success, 1)) { // Likely continue_processing();}
// C++20: [[likely]] and [[unlikely]] attributesif (condition) [[likely]] { // Expected path}
if (error) [[unlikely]] { // Rare case}Conditional Moves
Section titled “Conditional Moves”Avoid branches when possible:
// Branch-heavyint max(int a, int b) { if (a > b) return a; else return b;}
// Branchless (may be faster on some architectures)int max_branchless(int a, int b) { int diff = a - b; int sign = diff >> 31; // All 1s if negative return a - (diff & sign);}
// Using std::max (often optimized)int max_std(int a, int b) { return std::max(a, b);}Copy Elision and RVO
Section titled “Copy Elision and RVO”Return Value Optimization (RVO)
Section titled “Return Value Optimization (RVO)”// Compiler eliminates copy/movestruct Data { std::vector<int> large_data;
Data() : large_data(1000000, 0) {}};
// No copy is made!Data create_data() { return Data(); // Named RVO (NRVO)}
// Also works with local variablesstd::vector<int> create_vector() { std::vector<int> v{1, 2, 3}; return v; // NRVO}Explicit Move for Non-RVO Cases
Section titled “Explicit Move for Non-RVO Cases”// Force move when RVO doesn't applystd::vector<int> combine(std::vector<int> a, const std::vector<int>& b) { a.insert(a.end(), b.begin(), b.end()); return std::move(a); // Force move}Memory Access Optimizations
Section titled “Memory Access Optimizations”Structure of Arrays (SoA) vs Array of Structures (AoS)
Section titled “Structure of Arrays (SoA) vs Array of Structures (AoS)”// Array of Structures (AoS) - cache inefficient for processing one fieldstruct Particle { double x, y, z; double vx, vy, vz; double r, g, b;};std::vector<Particle> particles(1000000);
// Structure of Arrays (SoA) - cache efficientstruct Particles { std::vector<double> x, y, z; std::vector<double> vx, vy, vz; std::vector<double> r, g, b;};Particles particles;particles.x.resize(1000000);// Process just x coordinates - only x is in cache!Memory Alignment
Section titled “Memory Alignment”// Ensure alignment for SIMDstruct alignas(16) AlignedData { double values[4];};
// Using aligned_alloc or posix_memalignvoid* aligned_ptr;posix_memalign(&aligned_ptr, 16, size);free(aligned_ptr);
// C++17 std::aligned_allocauto* ptr = std::aligned_alloc(16, size);std::free(ptr);Cache Line Awareness
Section titled “Cache Line Awareness”// False sharing - avoid this!struct SharedData { std::atomic<int> counter1; std::atomic<int> counter2;};
// Better: pad to cache line sizestruct PaddedCounter { char padding[64 - sizeof(std::atomic<int>)]; std::atomic<int> counter;};Inlining and Compiler Hints
Section titled “Inlining and Compiler Hints”inline Keyword
Section titled “inline Keyword”// Suggest inlining (compiler may ignore)inline int small_function(int x) { return x * 2;}
// Always inline (compiler will try hard)__attribute__((always_inline)) int critical() { return 42;}
// C++20[[gnu::always_inline]] int critical() { return 42;}
// Never inline__attribute__((noinline)) int don't_inline() { // Large function}Forced Inline in Hot Paths
Section titled “Forced Inline in Hot Paths”// Put small, frequently called functions in headerclass HotPath {public: // Compiler inlines these easily inline int getValue() const { return value_; } inline void setValue(int v) { value_ = v; }
private: int value_;};SIMD and Vectorization
Section titled “SIMD and Vectorization”Using SIMD Intrinsics
Section titled “Using SIMD Intrinsics”#include <immintrin.h>
// AVX2 addition of 8 floatsvoid add_arrays_avx(const float* a, const float* b, float* result, size_t n) { for (size_t i = 0; i < n; i += 8) { __m256 va = _mm256_loadu_ps(&a[i]); __m256 vb = _mm256_loadu_ps(&b[i]); __m256 vr = _mm256_add_ps(va, vb); _mm256_storeu_ps(&result[i], vr); }}
// SSE addition of 4 floatsvoid add_arrays_sse(const float* a, const float* b, float* result, size_t n) { for (size_t i = 0; i < n; i += 4) { __m128 va = _mm_loadu_ps(&a[i]); __m128 vb = _mm_loadu_ps(&b[i]); __m128 vr = _mm_add_ps(va, vb); _mm_storeu_ps(&result[i], vr); }}Compiler Auto-vectorization
Section titled “Compiler Auto-vectorization”// Compiler can auto-vectorize thisvoid vectorize_me(std::vector<float>& data) { for (auto& x : data) { x = x * 2.0f + 1.0f; }}
// Make it easier for compilervoid clearer_intent(std::vector<float>& data) { std::transform(data.begin(), data.end(), data.begin(), [](float x) { return x * 2.0f + 1.0f; } );}constexpr and Compile-Time Computation
Section titled “constexpr and Compile-Time Computation”constexpr Functions (C++14+)
Section titled “constexpr Functions (C++14+)”// Computed at compile timeconstexpr int factorial(int n) { return n <= 1 ? 1 : n * factorial(n - 1);}
constexpr auto result = factorial(10); // Computed at compile time!
// C++17: constexpr lambdaconstexpr auto square = [](int x) { return x * x; };
// C++20: constexpr virtual and dynamic allocationconstexpr auto vec = std::vector<int>{1, 2, 3}; // C++20consteval (C++20)
Section titled “consteval (C++20)”// Must be evaluated at compile timeconsteval int square(int x) { return x * x;}
// This worksstatic_assert(square(5) == 25);
// This fails (runtime context)// int y = 10;// auto result = square(y); // Error!constexpr std::array Operations
Section titled “constexpr std::array Operations”// Compile-time computation with arraystemplate<typename T, std::size_t N>constexpr std::array<T, N> sort_array(std::array<T, N> arr) { for (std::size_t i = 0; i < N - 1; i++) { for (std::size_t j = 0; j < N - i - 1; j++) { if (arr[j] > arr[j + 1]) { auto temp = arr[j]; arr[j] = arr[j + 1]; arr[j + 1] = temp; } } } return arr;}
constexpr auto sorted = sort_array(std::array{5, 2, 8, 1, 9});static_assert(sorted[0] == 1);Lazy Evaluation
Section titled “Lazy Evaluation”// Lazy computation - only compute when neededtemplate<typename T>class Lazy { std::function<T()> compute_; mutable std::optional<T> cached_;public: explicit Lazy(std::function<T()> fn) : compute_(fn) {}
const T& get() const { if (!cached_) { cached_ = compute_(); } return *cached_; }};
// UsageLazy<int> expensive([]() { std::cout << "Computing...\n"; return 42;});std::cout << "Created\n"; // No computation yetstd::cout << expensive.get() << "\n"; // Computes nowstd::cout << expensive.get() << "\n"; // Uses cachedSmall String Optimization (SSO)
Section titled “Small String Optimization (SSO)”// std::string often stores short strings without heap allocationstd::string s = "Hello"; // No heap allocation on most implementations
// But concatenation can trigger allocationstd::string long_string;for (int i = 0; i < 1000; i++) { long_string += "x"; // O(n^2)!}
// Better: reserve or use stringstreamstd::string better;better.reserve(1000);for (int i = 0; i < 1000; i++) { better += "x"; // O(n)}Profile-Guided Optimization (PGO)
Section titled “Profile-Guided Optimization (PGO)”# 1. Instrumented buildclang++ -fprofile-instr-generate -O2 -o program program.cpp
# 2. Run with representative data./program < typical_input.txt
# 3. Optimize with profile dataclang++ -fprofile-instr-use=default.profdata -O2 -o optimized program.cppCompiler Flags for Optimization
Section titled “Compiler Flags for Optimization”| Flag | Effect |
|---|---|
| -O0 | No optimization (debug) |
| -O1 | Basic optimization |
| -O2 | Standard optimization |
| -O3 | Aggressive optimization |
| -Os | Size optimization |
| -Ofast | Fastest (may relax FP) |
| -march=native | CPU-specific |
| -flto | Link-time optimization |
Optimization Checklist
Section titled “Optimization Checklist”- Profile to identify bottleneck
- Measure baseline performance
- Apply single optimization
- Measure again
- Keep change if improvement
- Repeat with next optimization
Key Takeaways
Section titled “Key Takeaways”- Always profile before optimizing
- Focus on hot spots identified by profiler
- Prefer algorithmic improvements over micro-optimizations
- Cache locality often matters more than instruction count
- Use compiler hints sparingly - trust the compiler
- Test with realistic data after optimization