SentencePiece primer

SentencePiece is a language-agnostic subword tokenizer that treats raw text as a stream of Unicode code points and learns a vocabulary directly from sentences. Unlike whitespace-split BPE, it requires no pre-tokenization, which makes it the default choice for multilingual and CJK models. This primer walks through the three things you actually need to know before training your own model on Meridian.

1. Pick a model type

SentencePiece supports BPE, unigram, char, and word. Unigram is the default for new projects because it produces a probabilistic segmentation and supports subword regularization at training time, which improves robustness on noisy inputs.

2. Size the vocabulary

Most modern LLMs land between 32k and 128k tokens. Smaller vocabs give shorter input sequences for English but explode sequence length on non-Latin scripts. If your corpus is multilingual, push toward 64k and verify byte-fallback is enabled so unseen code points still round-trip cleanly.

3. Train and verify

The training call is one line. Always verify the encode/decode round-trip on a held-out sample before shipping the model into a tokenizer pipeline.

import sentencepiece as spm

spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='meridian',
    vocab_size=64000,
    model_type='unigram',
    character_coverage=0.9995,
    byte_fallback=True,
)

sp = spm.SentencePieceProcessor()
sp.load('meridian.model')
print(sp.encode('Hello, Meridian!', out_type=str))

← Back to all docs