BF16 (Brain Float 16) – LLM Glossary

BF16 (Brain Float 16) keeps the 8-bit exponent of FP32 but reduces the mantissa to 7 bits. This preserves the full dynamic range of FP32 (avoiding the overflow/underflow issues of FP16) at the same memory footprint.

Comparison: FP32 vs FP16 vs BF16

Format	Bits	Exponent	Mantissa	Dynamic Range	Use case
FP32	32	8	23	~1.2×10⁻³⁸ – 3.4×10³⁸	Training (CPU/GPU)
FP16	16	5	10	~6×10⁻⁸ – 6.5×10⁴	GPU inference
BF16	16	8	7	same as FP32	Training + inference on A100/H100

Hardware Support

BF16 is natively accelerated on NVIDIA A100, H100, H200, all Google TPUs, and AMD MI300X. Consumer cards (RTX 30xx, 40xx) support BF16 compute but at lower throughput than FP16 — check your card's specs before assuming BF16 is faster.

Why It Matters for On-Premise

If you own enterprise-grade hardware (A100, H100), loading models in BF16 gives you near-FP32 quality without the doubled memory cost. For most consumer-grade on-premise setups, FP16 is the better default since RTX 40xx cards are highly optimised for it. Running 7B models at BF16 requires ~14 GB VRAM — same as FP16 — but the numerical stability improvements make BF16 the preferred format for fine-tuning runs.