FP16 / FP32 – LLM Glossary

Floating point precision defines how many bits are used to represent each model weight. Lower precision = smaller VRAM footprint = faster memory bandwidth — with a small trade-off in numerical accuracy.

Precision Formats

Format	Bits	1B params	7B params	70B params
FP64	64	8 GB	56 GB	560 GB
FP32	32	4 GB	28 GB	280 GB
BF16	16	2 GB	14 GB	140 GB
FP16	16	2 GB	14 GB	140 GB
INT8 (Q8)	8	1 GB	7 GB	70 GB
INT4 (Q4)	4	0.5 GB	3.5 GB	35 GB

Rule of Thumb

A rough formula for VRAM requirements (inference only, not including KV cache): VRAM (GB) ≈ (parameters in billions) × (bits per weight) / 8. Add 20–30% for activations and KV cache overhead.

When to Use Each

FP32: Training on CPU, quantization calibration steps
BF16: Training and inference on A100/H100 — numerically stable, same size as FP16
FP16: Inference on consumer and server GPUs (RTX 20xx–40xx, V100)
INT8/INT4: See Quantization — use GGUF or GPTQ formats