The rough rule
For English text, a safe back-of-the-envelope estimate is 1 token ≈ 4 characters. This holds across most GPT-family tokenizers because common English words, spaces, and punctuation each land in the 1–3 token range, averaging out to roughly four characters per token over a large corpus.
100 words → ~500 chars → ~125 tokens
1,000 words → ~5,000 chars → ~1,250 tokens
Caveat: code, JSON, non-English languages, and repetitive patterns deviate significantly. Always measure with a real tokenizer for production budgets.
tiktoken — Python
OpenAI's official tokenizer. Install it, pick an encoding, and call .encode().
pip install tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Hello, world!")
print(len(tokens)) # 4Supported encodings: cl100k_base (GPT-4/3.5-turbo), o200k_base (GPT-4o/o1), p50k_base (text-davinci-003), r50k_base (GPT-3).
js-tiktoken — TypeScript
The JS port of tiktoken. Works in Node.js, Edge runtimes, and the browser. Uses the same BPE ranks as the Python library.
npm install js-tiktoken
import { encodingForModel } from "js-tiktoken";
const enc = encodingForModel("gpt-4o");
const tokens = enc.encode("Hello, world!");
console.log(tokens.length); // 4
enc.free(); // release WASM memoryAlways call .free() when done — the WASM backing store is not garbage-collected automatically.
When the 4-char rule breaks
- ▸Code & whitespace: indentation-heavy code can hit 1 token per 2–3 characters because each leading space is its own token.
- ▸Non-English: languages like Japanese, Korean, or Arabic often consume 1–2 tokens per character since each glyph is rare in the BPE vocabulary.
- ▸Repetition: “aaaaaaaaaa…” can blow up because the tokenizer has no entry for long runs of a single letter.
- ▸Special tokens: chat-template markers like
<|im_start|>count as tokens too — budget ~4–8 tokens per message for framing overhead.
Quick reference
| Input | Approx. tokens |
|---|---|
| 1 English word | 1.3 |
| 1 sentence (~15 words) | ~20 |
| 1 paragraph (~100 words) | ~130 |
| 1 page (~500 words) | ~650 |
| 1,000 tokens | ~750 words |