Tokens/s (Throughput) – LLM Glossary

Tokens/s is the rate at which a model generates output tokens during the decode phase. It determines whether a chat interaction feels instant or painfully slow, and how many parallel requests a server can handle.

Two Different Throughput Numbers

Prefill Throughput (tokens/s)

How fast the model processes the input prompt. Typically 5,000–50,000 tokens/s on a modern GPU — fast enough that users rarely notice. CPU prefill is much slower: 100–500 tokens/s.

Decode Throughput (tokens/s)

How fast output tokens are generated. The bottleneck. Every decode step is serial: 20–80 tokens/s on consumer GPU, 3–8 tokens/s on CPU. This is what users experience as "streaming speed".

Reference Speeds by Hardware

Hardware	Model	Quant	Decode t/s
RTX 4090 (24 GB)	Llama 3.1 8B	Q4_K_M	~80
RTX 4090 (24 GB)	Llama 3.1 70B	Q4_K_M (CPU offload)	~15
2× A100 80 GB	Llama 3.1 70B	BF16	~45
H100 80 GB	Llama 3.1 70B	BF16	~80
M2 Ultra (192 GB)	Llama 3.1 70B	Q4_K_M	~30
CPU (Ryzen 9, 64 GB RAM)	Llama 3.1 8B	Q4_K_M	~8

Human Perception Thresholds

<10 tokens/s: Noticeably slow — words appear one at a time. Acceptable for batch/background tasks.
10–30 tokens/s: Readable streaming speed. Fine for most chat interfaces.
>30 tokens/s: Faster than comfortable reading speed. Ideal for production.

Improving Throughput

Options in order of ease: (1) Use a smaller/more quantized model. (2) Enable Flash Attention. (3) Try speculative decoding. (4) Use vLLM for continuous batching (increases batch throughput, not single-user latency). (5) Upgrade GPU. Memory bandwidth, not FLOPS, is the limiting factor for decode speed — the RTX 4090 beats the A100 in single-user decode because of its higher memory bandwidth.