Performance Recipe
False Sharing Primer
False sharing is the silent killer of multi-threaded throughput. Two cores writing to independent variables that happen to live on the same 64-byte cache line will ping-pong that line through the coherence protocol, stalling both. This primer shows how to spot it, measure it, and eliminate it.
1. What is false sharing?
Modern x86 CPUs move memory in 64-byte cache lines. When core A writes a byte, the line is pulled into A's L1 in Modified state and invalidated everywhere else. If core B writes a different byte on the same line, the line bounces back. Neither core shares data with the other, but the hardware does not know that.
2. How to detect it
On Linux, perf c2c record attributes HITM events to specific cache lines and source lines. On Windows, the Intel VTune memory access analysis surfaces the same signal. A scaling curve that flattens or inverts past two threads is the canonical smell.
3. The fix
Pad hot per-thread state to a full cache line, or split the structure so each thread owns its own line. C++17 exposesstd::hardware_destructive_interference_sizefor portable alignment.
struct alignas(64) Counter {
std::atomic<uint64_t> value;
char pad[64 - sizeof(std::atomic<uint64_t>)];
};
// One Counter per worker thread.
// Each lives on its own cache line.
std::array<Counter, kThreadCount> counters;