Recipe / Performance
CPU Cache Line Primer
A cache line is the smallest chunk of memory the CPU moves between DRAM and its caches. On almost every modern x86 and ARM chip that chunk is 64 bytes wide. Reasoning about your data in cache-line units is the single biggest lever you have for squeezing real throughput out of hot loops, and it costs nothing at runtime.
1. Why 64 bytes matters
When the CPU reads a single byte it actually pulls the entire 64-byte line that byte lives on. That means the cost of reading one byte and the cost of reading 64 contiguous bytes is identical at the L1 boundary. Lay your hot fields next to each other and you get the next 63 reads free.
The opposite is also true. If two unrelated fields share a line, two cores writing them ping the line back and forth through the coherence protocol. This is false sharing and it can cost you a 10x slowdown without changing a single instruction.
2. Measuring line size at runtime
Never hardcode the constant. On x86 the value lives in CPUID leaf 1, EBX[15:8] shifted left 3. On Linux you can read it from sysfs without privileges. Here is the portable cheat sheet:
// Linux
$ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
// C++17
constexpr size_t LINE =
std::hardware_destructive_interference_size;
// Rust
const LINE: usize = 64; // verify with cpuid crate3. Practical layout rules
- Pad per-thread counters to a full line with
alignas(64)so cores never collide on the same line. - Pack read-mostly fields together and write-heavy fields together. The read line stays in shared state, the write line bounces less.
- Iterate arrays in order. The prefetcher recognises stride-one access and pulls the next 2 to 4 lines ahead of demand for free.
- Profile with
perf c2cto spot false sharing before you guess at it.