A tokenizer maps text ↔ integer IDs using a vocabulary learned during pretraining. The model never sees raw characters — only token IDs. The tokenizer is bundled with every model and must be used consistently at inference time.
Common Tokenization Algorithms
BPE (Byte-Pair Encoding)
Merges the most frequent character pairs iteratively to build the vocabulary. Used by GPT-2, GPT-4, Llama models (tiktoken / SentencePiece BPE). Vocabulary size: 32K–100K tokens.
WordPiece
Similar to BPE but uses likelihood instead of frequency as the merge criterion. Original BERT tokenizer. Common in embedding models.
Unigram (SentencePiece)
Starts with a large vocabulary and prunes based on EM likelihood. Used by XLNet, mT5. Handles multiple languages and spaces uniformly.
Token Counting Rules of Thumb
| Content type | Approx tokens |
|---|---|
| 1 English word (common) | ~1 token |
| 1 English word (average) | ~1.3 tokens |
| 1 page of English text (~500 words) | ~650 tokens |
| 1 Italian word (average) | ~1.5–2 tokens (morphology) |
| 1 line of Python code | ~10–15 tokens |
| 1 JSON key-value pair | ~5–10 tokens |
Why Tokenizer Choice Matters
A model with a multilingual vocabulary (e.g., Llama 3 vs original Llama 1) tokenizes Italian, French, or German text into far fewer tokens — meaning more content fits in the same context window and inference is faster. Always check the tokenizer's vocabulary composition if you deploy in non-English languages.