Recipe

Byte-pair encoding primer

Byte-pair encoding (BPE) is the tokenization workhorse behind most modern LLMs. It starts from raw bytes and greedily merges the most frequent adjacent pair into a new symbol, repeating until the vocabulary hits a target size. The result: a compact, reversible mapping from text to integer ids that handles any Unicode string without an <UNK> token.

1. Initialize the vocabulary

Begin with the 256 single-byte tokens. Every input string is first encoded to UTF-8 bytes, guaranteeing full coverage. No pretokenization heuristic is required for correctness, though GPT-style models split on whitespace and punctuation first to keep merges interpretable.

2. Count and merge pairs

Scan the corpus, tally every adjacent token pair, and merge the most frequent pair into a new id. Append the merge rule to an ordered list. Repeat for N iterations where N is your vocab budget minus 256.

3. Encode and decode

At inference, replay the merge rules in order against the byte sequence. Decoding is just a lookup table from id back to bytes, then a UTF-8 decode. The whole pipeline is lossless and deterministic.

def train_bpe(corpus, vocab_size):
    ids = list(corpus.encode("utf-8"))
    merges = {}
    for new_id in range(256, vocab_size):
        pair = most_common_pair(ids)
        ids = merge(ids, pair, new_id)
        merges[pair] = new_id
    return merges

Back to all recipes