Recipe

SIMD Primer

Single Instruction Multiple Data lets one CPU instruction operate on several data lanes at once. This primer walks through the mental model, the Meridian helpers that wrap intrinsics, and a small worked example you can paste into a worker. By the end you should be able to vectorize a hot loop without leaving TypeScript.

1. The Lane Model

Think of a SIMD register as a fixed-width bus carrying parallel lanes of the same scalar type. A 128-bit register holds four float32 lanes or sixteen int8 lanes. Every arithmetic op fires on all lanes simultaneously, so a four-lane multiply costs roughly what a single scalar multiply would. The win compounds when your inner loop is memory-bound and cache-friendly.

2. Meridian Helpers

Meridian ships a thin wrapper over WebAssembly SIMD128 so you can target Chromium, Firefox, and Node 20+ with the same source. The helper module exposes f32x4, i32x4, and i16x8 lane types plus load/store, splat, and fused multiply-add. Below is the canonical dot-product kernel; it autotunes the tail loop based on the input length modulo four.

import { f32x4, load, store, fma } from '@meridian/simd';

export function dot(a: Float32Array, b: Float32Array): number {
  let acc = f32x4.splat(0);
  const tail = a.length - (a.length % 4);
  for (let i = 0; i < tail; i += 4) {
    const va = load(a, i);
    const vb = load(b, i);
    acc = fma(va, vb, acc);
  }
  let scalar = acc.lane(0) + acc.lane(1) + acc.lane(2) + acc.lane(3);
  for (let i = tail; i < a.length; i++) scalar += a[i] * b[i];
  return scalar;
}

3. When Not To Vectorize

SIMD has overhead: lane setup, alignment shuffles, and reduced register pressure for the surrounding scalar code. If your loop runs fewer than a few thousand iterations, or branches divergently per lane, the scalar path will usually win. Profile first with meridian bench, confirm the hotspot is arithmetic-bound, and only then reach for the lane helpers.