Recipe

Tokenizer deep dive

Tokenization is the load-bearing step between raw text and a model's embedding table. This recipe walks through how Meridian normalizes, segments, and encodes inputs before they ever reach an LLM, and how to inspect the resulting token stream when something looks off.

1. Normalization

Meridian applies NFC Unicode normalization, strips zero-width joiners, and collapses runs of whitespace. This makes byte-pair merges deterministic across copy-pasted content from Word, Slack, and terminal sources that would otherwise hash to different ids.

2. Segmentation

A pre-tokenizer splits on whitespace and punctuation boundaries, then BPE merges run greedily over each pre-token. Numbers are digit-split so "2026" encodes to four ids; code blocks are preserved verbatim so indentation survives the round trip.

3. Encoding

Final ids are packed into a Uint32Array alongside an offset map so you can slice back to source spans. Inspect a stream with the built-in CLI:

$ meridian tokenize --model gpt-4o "hello world"
[ 24912, 2375 ]
offsets: [ [0,5], [5,11] ]
bytes:   11  tokens: 2  ratio: 5.5