The amount of VRAM a model needs is driven by two things: how many parameters it has, and how many bytes each parameter takes — which depends on quantization. Quantization trades a little quality for a large memory saving, and is how 70B models fit on a single card.
Llama 70B by quantization
| Precision | VRAM needed | Fits on |
|---|---|---|
| 4-bit (Q4) | ~40-48GB | 48GB card, or 2×24GB |
| 8-bit (Q8) | ~70GB | 80GB card |
| FP16 | ~140GB+ | 2× 80GB (multi-GPU) |
Add ~10-20% on top for the KV cache, which grows with context length — long contexts need noticeably more memory.
The sizing formula for any model
VRAM (GB) ≈ params(B) × bytes/weight × 1.15
where bytes/weight ≈ 0.5 (4-bit), 1 (8-bit), 2 (FP16). Example: 13B at 4-bit ≈ 13 × 0.5 × 1.15 ≈ 7.5GB.