How much VRAM does Llama 70B need?

About 40-48GB at 4-bit quantization, ~70GB at 8-bit, and ~140GB+ at full FP16. So a single 48GB card (A6000) or 80GB (A100/H100) runs it at 4-bit; FP16 needs multi-GPU.

What is the VRAM formula?

VRAM ≈ parameters (billions) × bytes per weight + context overhead. Bytes per weight: ~0.5 (4-bit), ~1 (8-bit), ~2 (FP16). Add 10-20% for KV cache and activations.

AI-RADAR.IT · AI-RADAR.NET · AI-RADAR.TECH

News & analysis on local LLMs, stack & on-prem hardware.

GUIDESIZING

How much VRAM do you need to run Llama 70B?

Evergreen guide · updated 2026

Key takeaway

A 70B model needs roughly 40-48GB of VRAM at 4-bit, ~70GB at 8-bit, and ~140GB+ at FP16. In practice: one 48GB A6000 or 80GB A100/H100 runs it quantized; full precision needs multiple GPUs. The general rule: VRAM ≈ params(B) × bytes-per-weight + ~15% overhead.

The amount of VRAM a model needs is driven by two things: how many parameters it has, and how many bytes each parameter takes — which depends on quantization. Quantization trades a little quality for a large memory saving, and is how 70B models fit on a single card.

Llama 70B by quantization

Precision	VRAM needed	Fits on
4-bit (Q4)	~40-48GB	48GB card, or 2×24GB
8-bit (Q8)	~70GB	80GB card
FP16	~140GB+	2× 80GB (multi-GPU)

Add ~10-20% on top for the KV cache, which grows with context length — long contexts need noticeably more memory.

The sizing formula for any model

VRAM (GB) ≈ params(B) × bytes/weight × 1.15
where bytes/weight ≈ 0.5 (4-bit), 1 (8-bit), 2 (FP16). Example: 13B at 4-bit ≈ 13 × 0.5 × 1.15 ≈ 7.5GB.

Continue exploring