VRAM – LLM Glossary

VRAM is to LLMs what RAM is to databases — but 10–50× faster and far more constrained in size. Running a 70B model at full FP16 precision requires 140 GB of VRAM; quantized to Q4_K_M, it drops to ~40 GB. Hardware sizing for on-premise LLM is almost entirely a VRAM planning exercise.

VRAM Requirements by Model Size

Model size	FP16	Q8_0	Q4_K_M	Q3_K_M
1B	2 GB	1 GB	0.7 GB	0.5 GB
3.8B (Phi-4)	7.6 GB	3.8 GB	2.5 GB	1.8 GB
7B	14 GB	7 GB	4.5 GB	3.3 GB
13B	26 GB	13 GB	8 GB	6 GB
32B	64 GB	32 GB	20 GB	14 GB
70B	140 GB	70 GB	42 GB	30 GB
405B	810 GB	405 GB	245 GB	180 GB

Add 20-30% for KV cache and activations in production serving.

GPU VRAM Reference

GPU	VRAM	Bandwidth	Class
RTX 3090 / 4090	24 GB	936 / 1008 GB/s	Consumer
RTX 4000 Ada	20 GB	432 GB/s	Workstation
A6000 Ada	48 GB	864 GB/s	Workstation
A100 80 GB	80 GB	2,000 GB/s	Data centre
H100 SXM	80 GB	3,350 GB/s	Data centre
H200	141 GB	4,800 GB/s	Data centre
M2 Ultra (unified)	192 GB	800 GB/s	Apple Silicon

What Happens When Model Spills to System RAM

If the model doesn't fit in VRAM, weights are offloaded to system RAM and transferred over the PCIe bus during inference. PCIe 4.0 x16 bandwidth is ~32 GB/s — vs 2,000 GB/s for an A100's HBM. Result: 50–100× slower tokens/s. For production, the model must fit in VRAM. For development/testing, CPU offload (llama.cpp --n-gpu-layers) is acceptable.

Multi-GPU Strategies

Tensor Parallelism: Each GPU holds a shard of every layer. Best for single requests. Requires NVLink or high-bandwidth interconnect.
Pipeline Parallelism: Each GPU holds a set of complete layers. Better throughput for many short requests.
NVLink vs PCIe: NVLink offers 600 GB/s GPU-GPU bandwidth vs ~32 GB/s for PCIe. Critical for tensor parallelism at 70B+ scale.