VRAM is to LLMs what RAM is to databases — but 10–50× faster and far more constrained in size. Running a 70B model at full FP16 precision requires 140 GB of VRAM; quantized to Q4_K_M, it drops to ~40 GB. Hardware sizing for on-premise LLM is almost entirely a VRAM planning exercise.
VRAM Requirements by Model Size
| Model size | FP16 | Q8_0 | Q4_K_M | Q3_K_M |
|---|---|---|---|---|
| 1B | 2 GB | 1 GB | 0.7 GB | 0.5 GB |
| 3.8B (Phi-4) | 7.6 GB | 3.8 GB | 2.5 GB | 1.8 GB |
| 7B | 14 GB | 7 GB | 4.5 GB | 3.3 GB |
| 13B | 26 GB | 13 GB | 8 GB | 6 GB |
| 32B | 64 GB | 32 GB | 20 GB | 14 GB |
| 70B | 140 GB | 70 GB | 42 GB | 30 GB |
| 405B | 810 GB | 405 GB | 245 GB | 180 GB |
Add 20-30% for KV cache and activations in production serving.
GPU VRAM Reference
| GPU | VRAM | Bandwidth | Class |
|---|---|---|---|
| RTX 3090 / 4090 | 24 GB | 936 / 1008 GB/s | Consumer |
| RTX 4000 Ada | 20 GB | 432 GB/s | Workstation |
| A6000 Ada | 48 GB | 864 GB/s | Workstation |
| A100 80 GB | 80 GB | 2,000 GB/s | Data centre |
| H100 SXM | 80 GB | 3,350 GB/s | Data centre |
| H200 | 141 GB | 4,800 GB/s | Data centre |
| M2 Ultra (unified) | 192 GB | 800 GB/s | Apple Silicon |
What Happens When Model Spills to System RAM
If the model doesn't fit in VRAM, weights are offloaded to system RAM and transferred over the PCIe bus during inference. PCIe 4.0 x16 bandwidth is ~32 GB/s — vs 2,000 GB/s for an A100's HBM. Result: 50–100× slower tokens/s. For production, the model must fit in VRAM. For development/testing, CPU offload (llama.cpp --n-gpu-layers) is acceptable.
Multi-GPU Strategies
- Tensor Parallelism: Each GPU holds a shard of every layer. Best for single requests. Requires NVLink or high-bandwidth interconnect.
- Pipeline Parallelism: Each GPU holds a set of complete layers. Better throughput for many short requests.
- NVLink vs PCIe: NVLink offers 600 GB/s GPU-GPU bandwidth vs ~32 GB/s for PCIe. Critical for tensor parallelism at 70B+ scale.