FREE RESOURCE
The Local LLM Hardware Cheat-Sheet
One page with everything you need to size hardware for local LLMs: the VRAM formula, GPU tiers, model-to-VRAM table, quantization and cost rules of thumb. Enter your email and it appears below — plus you'll get our analysis when there's real signal.
✓ Here it is — bookmark or print this page.
1. The VRAM formula
VRAM (GB) ≈ params(B) × bytes/weight × 1.15
bytes/weight: 0.5 (4-bit) · 1 (8-bit) · 2 (FP16). The ×1.15 covers the KV cache; long contexts need more.
2. Model → VRAM (4-bit)
| Model | VRAM (4-bit) | Card |
| 7B | ~4GB | 8GB |
| 13B | ~8GB | 12–16GB |
| 34B | ~20GB | 24GB (3090/4090) |
| 70B | ~40–48GB | 48GB (A6000) / 2×24GB |
| 70B (8-bit) | ~70GB | 80GB (A100/H100) |
3. GPU tiers
- 24GB (RTX 3090/4090) — sweet spot: 7B–34B. Used 3090 = best value/GB.
- 48GB (A6000 / 2×24GB) — 70B at 4-bit.
- 80GB (A100/H100) — 70B+ and training; often cheaper to rent.
4. Quantization
4-bit = the default (≈4× smaller, small quality cost). 8-bit ≈ near-lossless. GGUF for CPU/Mac (Ollama/LM Studio); AWQ/GPTQ for GPU serving (vLLM/TGI).
5. Cost rule of thumb
Local beats cloud only above ~30–40% sustained GPU utilization. Below that, rent per-second. Compare €/million-tokens (amortized hardware + power + maintenance ÷ tokens), not GPU price vs hourly rate.
6. Software
LM Studio (GUI, no-code) · Ollama (dev, one command, local API) · vLLM/TGI (production, high concurrency). All speak an OpenAI-compatible API.
Want the detail behind each line? See the full guides →