Tokens/s is the rate at which a model generates output tokens during the decode phase. It determines whether a chat interaction feels instant or painfully slow, and how many parallel requests a server can handle.
Two Different Throughput Numbers
Prefill Throughput (tokens/s)
How fast the model processes the input prompt. Typically 5,000–50,000 tokens/s on a modern GPU — fast enough that users rarely notice. CPU prefill is much slower: 100–500 tokens/s.
Decode Throughput (tokens/s)
How fast output tokens are generated. The bottleneck. Every decode step is serial: 20–80 tokens/s on consumer GPU, 3–8 tokens/s on CPU. This is what users experience as "streaming speed".
Reference Speeds by Hardware
| Hardware | Model | Quant | Decode t/s |
|---|---|---|---|
| RTX 4090 (24 GB) | Llama 3.1 8B | Q4_K_M | ~80 |
| RTX 4090 (24 GB) | Llama 3.1 70B | Q4_K_M (CPU offload) | ~15 |
| 2× A100 80 GB | Llama 3.1 70B | BF16 | ~45 |
| H100 80 GB | Llama 3.1 70B | BF16 | ~80 |
| M2 Ultra (192 GB) | Llama 3.1 70B | Q4_K_M | ~30 |
| CPU (Ryzen 9, 64 GB RAM) | Llama 3.1 8B | Q4_K_M | ~8 |
Human Perception Thresholds
- <10 tokens/s: Noticeably slow — words appear one at a time. Acceptable for batch/background tasks.
- 10–30 tokens/s: Readable streaming speed. Fine for most chat interfaces.
- >30 tokens/s: Faster than comfortable reading speed. Ideal for production.
Improving Throughput
Options in order of ease: (1) Use a smaller/more quantized model. (2) Enable Flash Attention. (3) Try speculative decoding. (4) Use vLLM for continuous batching (increases batch throughput, not single-user latency). (5) Upgrade GPU. Memory bandwidth, not FLOPS, is the limiting factor for decode speed — the RTX 4090 beats the A100 in single-user decode because of its higher memory bandwidth.