Tokenizer

Core

The component that converts raw text into token IDs (numbers) that the model processes — and converts token IDs back to text. A critical factor in multilingual performance and context efficiency.

A tokenizer maps text ↔ integer IDs using a vocabulary learned during pretraining. The model never sees raw characters — only token IDs. The tokenizer is bundled with every model and must be used consistently at inference time.

Common Tokenization Algorithms

BPE (Byte-Pair Encoding)

Merges the most frequent character pairs iteratively to build the vocabulary. Used by GPT-2, GPT-4, Llama models (tiktoken / SentencePiece BPE). Vocabulary size: 32K–100K tokens.

WordPiece

Similar to BPE but uses likelihood instead of frequency as the merge criterion. Original BERT tokenizer. Common in embedding models.

Unigram (SentencePiece)

Starts with a large vocabulary and prunes based on EM likelihood. Used by XLNet, mT5. Handles multiple languages and spaces uniformly.

Token Counting Rules of Thumb

Content typeApprox tokens
1 English word (common)~1 token
1 English word (average)~1.3 tokens
1 page of English text (~500 words)~650 tokens
1 Italian word (average)~1.5–2 tokens (morphology)
1 line of Python code~10–15 tokens
1 JSON key-value pair~5–10 tokens

Why Tokenizer Choice Matters

A model with a multilingual vocabulary (e.g., Llama 3 vs original Llama 1) tokenizes Italian, French, or German text into far fewer tokens — meaning more content fits in the same context window and inference is faster. Always check the tokenizer's vocabulary composition if you deploy in non-English languages.