VRAM

Hardware

Video RAM — the GPU's onboard memory. The single most critical hardware resource for on-premise LLM inference. All model weights and KV cache must fit inside VRAM for full GPU speed.

VRAM is to LLMs what RAM is to databases — but 10–50× faster and far more constrained in size. Running a 70B model at full FP16 precision requires 140 GB of VRAM; quantized to Q4_K_M, it drops to ~40 GB. Hardware sizing for on-premise LLM is almost entirely a VRAM planning exercise.

VRAM Requirements by Model Size

Model sizeFP16Q8_0Q4_K_MQ3_K_M
1B2 GB1 GB0.7 GB0.5 GB
3.8B (Phi-4)7.6 GB3.8 GB2.5 GB1.8 GB
7B14 GB7 GB4.5 GB3.3 GB
13B26 GB13 GB8 GB6 GB
32B64 GB32 GB20 GB14 GB
70B140 GB70 GB42 GB30 GB
405B810 GB405 GB245 GB180 GB

Add 20-30% for KV cache and activations in production serving.

GPU VRAM Reference

GPUVRAMBandwidthClass
RTX 3090 / 409024 GB936 / 1008 GB/sConsumer
RTX 4000 Ada20 GB432 GB/sWorkstation
A6000 Ada48 GB864 GB/sWorkstation
A100 80 GB80 GB2,000 GB/sData centre
H100 SXM80 GB3,350 GB/sData centre
H200141 GB4,800 GB/sData centre
M2 Ultra (unified)192 GB800 GB/sApple Silicon

What Happens When Model Spills to System RAM

If the model doesn't fit in VRAM, weights are offloaded to system RAM and transferred over the PCIe bus during inference. PCIe 4.0 x16 bandwidth is ~32 GB/s — vs 2,000 GB/s for an A100's HBM. Result: 50–100× slower tokens/s. For production, the model must fit in VRAM. For development/testing, CPU offload (llama.cpp --n-gpu-layers) is acceptable.

Multi-GPU Strategies

  • Tensor Parallelism: Each GPU holds a shard of every layer. Best for single requests. Requires NVLink or high-bandwidth interconnect.
  • Pipeline Parallelism: Each GPU holds a set of complete layers. Better throughput for many short requests.
  • NVLink vs PCIe: NVLink offers 600 GB/s GPU-GPU bandwidth vs ~32 GB/s for PCIe. Critical for tensor parallelism at 70B+ scale.