FP16 / FP32

Hardware

Floating point precision formats that determine model weight storage size. FP32 for training, FP16 for GPU inference — each format halves VRAM vs the previous.

Floating point precision defines how many bits are used to represent each model weight. Lower precision = smaller VRAM footprint = faster memory bandwidth — with a small trade-off in numerical accuracy.

Precision Formats

FormatBits1B params7B params70B params
FP64648 GB56 GB560 GB
FP32324 GB28 GB280 GB
BF16162 GB14 GB140 GB
FP16162 GB14 GB140 GB
INT8 (Q8)81 GB7 GB70 GB
INT4 (Q4)40.5 GB3.5 GB35 GB

Rule of Thumb

A rough formula for VRAM requirements (inference only, not including KV cache): VRAM (GB) ≈ (parameters in billions) × (bits per weight) / 8. Add 20–30% for activations and KV cache overhead.

When to Use Each

  • FP32: Training on CPU, quantization calibration steps
  • BF16: Training and inference on A100/H100 — numerically stable, same size as FP16
  • FP16: Inference on consumer and server GPUs (RTX 20xx–40xx, V100)
  • INT8/INT4: See Quantization — use GGUF or GPTQ formats