Token estimation

Understand how LLM tokenizers count text so you can budget API costs and size prompts accurately.

The rough rule

For English text, a safe back-of-the-envelope estimate is 1 token ≈ 4 characters. This holds across most GPT-family tokenizers because common English words, spaces, and punctuation each land in the 1–3 token range, averaging out to roughly four characters per token over a large corpus.

100 words ~500 chars ~125 tokens

1,000 words ~5,000 chars ~1,250 tokens

Caveat: code, JSON, non-English languages, and repetitive patterns deviate significantly. Always measure with a real tokenizer for production budgets.

tiktoken — Python

OpenAI's official tokenizer. Install it, pick an encoding, and call .encode().

pip install tiktoken

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Hello, world!")
print(len(tokens))  # 4

Supported encodings: cl100k_base (GPT-4/3.5-turbo), o200k_base (GPT-4o/o1), p50k_base (text-davinci-003), r50k_base (GPT-3).

js-tiktoken — TypeScript

The JS port of tiktoken. Works in Node.js, Edge runtimes, and the browser. Uses the same BPE ranks as the Python library.

npm install js-tiktoken

import { encodingForModel } from "js-tiktoken";

const enc = encodingForModel("gpt-4o");
const tokens = enc.encode("Hello, world!");
console.log(tokens.length); // 4
enc.free(); // release WASM memory

Always call .free() when done — the WASM backing store is not garbage-collected automatically.

When the 4-char rule breaks

  • Code & whitespace: indentation-heavy code can hit 1 token per 2–3 characters because each leading space is its own token.
  • Non-English: languages like Japanese, Korean, or Arabic often consume 1–2 tokens per character since each glyph is rare in the BPE vocabulary.
  • Repetition: “aaaaaaaaaa…” can blow up because the tokenizer has no entry for long runs of a single letter.
  • Special tokens: chat-template markers like <|im_start|> count as tokens too — budget ~4–8 tokens per message for framing overhead.

Quick reference

InputApprox. tokens
1 English word1.3
1 sentence (~15 words)~20
1 paragraph (~100 words)~130
1 page (~500 words)~650
1,000 tokens~750 words
Meridian · Token estimation reference · Updated 2026