Floating point precision defines how many bits are used to represent each model weight. Lower precision = smaller VRAM footprint = faster memory bandwidth — with a small trade-off in numerical accuracy.
Precision Formats
| Format | Bits | 1B params | 7B params | 70B params |
|---|---|---|---|---|
| FP64 | 64 | 8 GB | 56 GB | 560 GB |
| FP32 | 32 | 4 GB | 28 GB | 280 GB |
| BF16 | 16 | 2 GB | 14 GB | 140 GB |
| FP16 | 16 | 2 GB | 14 GB | 140 GB |
| INT8 (Q8) | 8 | 1 GB | 7 GB | 70 GB |
| INT4 (Q4) | 4 | 0.5 GB | 3.5 GB | 35 GB |
Rule of Thumb
A rough formula for VRAM requirements (inference only, not including KV cache): VRAM (GB) ≈ (parameters in billions) × (bits per weight) / 8. Add 20–30% for activations and KV cache overhead.
When to Use Each
- FP32: Training on CPU, quantization calibration steps
- BF16: Training and inference on A100/H100 — numerically stable, same size as FP16
- FP16: Inference on consumer and server GPUs (RTX 20xx–40xx, V100)
- INT8/INT4: See Quantization — use GGUF or GPTQ formats