Recipe

Branch prediction primer

Modern CPUs speculatively execute instructions past conditional branches. When the predictor guesses right, you get free throughput. When it guesses wrong, the pipeline flushes and you pay 15-20 cycles. This primer walks through how branch prediction shapes hot-path code and how Meridian routes inference workloads to dodge the worst cases.

1. Why branches cost more than they look

A single mispredicted branch in a tight loop can dominate the runtime of an otherwise vectorizable kernel. Inside Meridian's router, the model-selection switch was the original hotspot until we converted it to a jump table indexed by model id.

The lesson: data-dependent branches in inner loops are the enemy. Predictable, repetitive patterns are cheap; random or input-driven patterns are expensive.

2. Making branches predictable

Three techniques carry most of the wins: sort your data so equal keys cluster, hoist invariant conditions out of the loop, and replace short conditionals with branchless arithmetic where the compiler will not do it for you.

// Branchy
for (int i = 0; i < n; i++) {
  if (data[i] >= 128) sum += data[i];
}

// Branchless
for (int i = 0; i < n; i++) {
  int mask = -(data[i] >= 128);
  sum += data[i] & mask;
}

3. How Meridian applies this

The gateway hot path batches requests by model id before dispatching, so the dispatch switch sees long runs of identical targets and the predictor stays warm. Token accounting uses branchless masks for the common case of in-budget requests, falling back to a branch only when a quota check actually trips. The result: median routing overhead under 80 microseconds even at 250 concurrent streams.