Tokens/s (Throughput)

Metric

The primary performance metric for LLM inference — how many output tokens are generated per second. Determines user experience and how many concurrent users a deployment can serve.

Tokens/s is the rate at which a model generates output tokens during the decode phase. It determines whether a chat interaction feels instant or painfully slow, and how many parallel requests a server can handle.

Two Different Throughput Numbers

Prefill Throughput (tokens/s)

How fast the model processes the input prompt. Typically 5,000–50,000 tokens/s on a modern GPU — fast enough that users rarely notice. CPU prefill is much slower: 100–500 tokens/s.

Decode Throughput (tokens/s)

How fast output tokens are generated. The bottleneck. Every decode step is serial: 20–80 tokens/s on consumer GPU, 3–8 tokens/s on CPU. This is what users experience as "streaming speed".

Reference Speeds by Hardware

HardwareModelQuantDecode t/s
RTX 4090 (24 GB)Llama 3.1 8BQ4_K_M~80
RTX 4090 (24 GB)Llama 3.1 70BQ4_K_M (CPU offload)~15
2× A100 80 GBLlama 3.1 70BBF16~45
H100 80 GBLlama 3.1 70BBF16~80
M2 Ultra (192 GB)Llama 3.1 70BQ4_K_M~30
CPU (Ryzen 9, 64 GB RAM)Llama 3.1 8BQ4_K_M~8

Human Perception Thresholds

  • <10 tokens/s: Noticeably slow — words appear one at a time. Acceptable for batch/background tasks.
  • 10–30 tokens/s: Readable streaming speed. Fine for most chat interfaces.
  • >30 tokens/s: Faster than comfortable reading speed. Ideal for production.

Improving Throughput

Options in order of ease: (1) Use a smaller/more quantized model. (2) Enable Flash Attention. (3) Try speculative decoding. (4) Use vLLM for continuous batching (increases batch throughput, not single-user latency). (5) Upgrade GPU. Memory bandwidth, not FLOPS, is the limiting factor for decode speed — the RTX 4090 beats the A100 in single-user decode because of its higher memory bandwidth.