How much VRAM do I need for local LLMs?

Rule of thumb: VRAM (GB) must exceed parameters in billions times bytes-per-weight (about 0.5-0.6 for 4-bit), plus ~15% for context. A 4-bit 13B needs ~8-10GB; a 4-bit 70B needs ~40-48GB.

Is a used RTX 3090 still worth it in 2026?

For local LLMs, yes — its 24GB VRAM and 936 GB/s bandwidth are what matter, and it remains the best price-per-GB for individuals, often at a large discount to a 4090 while running the same 4-bit models at similar chat speeds.

Can two GPUs combine their VRAM?

Effectively yes for inference: frameworks shard a model across GPUs, so 2x24GB can hold a model needing ~48GB. NVLink helps bandwidth but is not required for many inference setups.

Does memory bandwidth matter more than compute?

For single-stream LLM inference, yes — token generation is largely memory-bandwidth bound. Compute matters more for prompt processing, training and high-batch serving. This is why a 3090 feels close to a 4090 for chat.

Are AMD GPUs viable for local LLMs?

For inference via llama.cpp (Vulkan/ROCm), yes — high-VRAM Radeon cards are usable and cheap per GB. The gap remains in the broader tooling: CUDA-first projects, fine-tuning stacks and production servers still favor NVIDIA.

Is a Mac a good alternative to a GPU for local LLMs?

A Mac with lots of unified memory runs very large models a discrete GPU cannot fit, at lower power — but generates tokens slower than a high-end GPU and is weak for fine-tuning. Great for quiet desktop use of big models; not for serving.

Best GPUs for Running Local LLMs (2026): Full Buyer Guide

The two numbers that decide everything

Local LLM performance is gated almost entirely by VRAM: if the model plus its context fits in GPU memory, it runs fast; if it doesn't, layers spill to system RAM and speed collapses by an order of magnitude. Choosing a GPU is really choosing a VRAM tier.

Within a tier, the differentiator is memory bandwidth — and this is the spec almost every buyer ignores. Generating a token requires streaming essentially the whole model through the GPU once, so single-user tokens/sec ≈ bandwidth ÷ model size. A 3090 (936 GB/s) generating from a 20 GB model tops out around 40–45 tokens/sec; a card with 288 GB/s manages ~13 with the same model. This has an uncomfortable consequence: some "budget AI" cards with generous VRAM but narrow buses fit big models and then run them at unusable speeds. Always check both numbers.

Compute (TFLOPS, tensor cores) matters third: it governs prompt processing (how fast long inputs are ingested — noticeable in RAG with big contexts), fine-tuning, and high-batch serving. For chat-style use, bandwidth dominates.

VRAM tiers — what each actually runs

VRAM	Example cards	Bandwidth	Runs (4-bit)
8–12GB	RTX 3060 12GB, 4060, A2000	~270–360 GB/s	up to 7–8B; tight context
16GB	RTX 4060 Ti 16GB, 4070 Ti S, A4000	~288–672 GB/s	up to ~13–14B; watch the bus on the 4060 Ti
24GB	RTX 3090 / 4090, A5000	~770–1010 GB/s	7B–34B comfortably
32GB	RTX 5090	~1790 GB/s	up to ~40B; fastest consumer inference
48GB	RTX A6000 / 6000 Ada, 2×3090/4090	~770–960 GB/s	up to ~70B (4-bit)
80–96GB	A100 / H100 / RTX 6000 Blackwell	~2000–3350 GB/s	70B+ at higher quants, training

The sizing formula

VRAM (GB) ≈ params(B) × bytes/weight × 1.15

Bytes per weight ≈ 0.5–0.6 (4-bit), 1 (8-bit), 2 (FP16). The ×1.15 covers the KV-cache at moderate context — long contexts (32k+) can need far more, and quantizing the cache to Q8 claws much of it back (see the quantization guide). Example: a 4-bit 34B ≈ 34 × 0.55 × 1.15 ≈ 21GB, so it fits a 24GB card with context to spare. A 4-bit 70B ≈ 40GB weights — hence the 48GB tier.

The used market: where the value is

The single best deal in local AI remains the used RTX 3090: 24GB of 936 GB/s VRAM at a fraction of new-24GB prices. Because chat inference is bandwidth-bound, it delivers 70–85% of a 4090's token speed on the same models. What to check when buying used:

Mining history — not automatically disqualifying (undervolted mining cards often lived gentle lives), but price it in. Ask for stress-test screenshots (e.g. a FurMark/occt run showing temps).
VRAM thermals — the 3090's rear memory chips run hot; many used cards benefit from a €20 thermal-pad refresh. Memory junction temps above ~100°C under load mean the pads are done.
Warranty reality — assume none. The discount vs a new card is your insurance budget.
3090 Ti / used 4090 — the Ti fixes the memory-thermal design; used 4090s rarely fall to compelling prices because of AI demand. If the gap to a used 3090 is large, take the 3090.

Ex-datacenter/workstation cards (A5000 24GB, A6000 48GB) appear on the used market too: blower coolers (louder but great for stacking), lower power, 2-slot width — often the smartest path to 48GB in a workstation.

NVIDIA, AMD, Intel — the honest state of play

NVIDIA is the default for a reason: every tool works, day one, no caveats — CUDA remains the moat. If your time has value or you want fine-tuning, image models, and serving stacks to just work, buy NVIDIA.

AMD is genuinely usable for inference: llama.cpp runs well via Vulkan or ROCm, and high-VRAM Radeons undercut NVIDIA per GB. The friction appears beyond basic inference — ROCm officially supports a narrow card list, fine-tuning stacks are CUDA-first, and many projects' AMD paths are community-maintained. Good for a budget dedicated inference box; frustrating as an experimentation platform.

Intel Arc offers aggressive VRAM pricing and improving llama.cpp support (SYCL/Vulkan), but the ecosystem is younger still. Viable for tinkerers; not yet a set-and-forget choice.

The Apple Silicon alternative

A Mac with unified memory plays a different game: an M-series Max/Ultra with 64–192GB can hold models no consumer GPU can — a 70B at Q5, even 100B+ MoE models — silently, at a fraction of the power draw. The trade-offs: memory bandwidth (~400–800 GB/s) is below high-end discrete GPUs, so big-model token speeds are usable-but-not-fast; prompt processing on very long contexts is notably slower than CUDA cards; and fine-tuning support (MLX aside) is limited.

Choose a Mac if you want the largest possible models on a quiet desktop and you already live in that ecosystem. Choose a GPU box for speed, serving, fine-tuning, or price at the 24GB tier. Many practitioners end up with both: Mac for daily driving big models, GPU rig for heavy lifting.

Multi-GPU: combining VRAM

For inference, frameworks shard a model across GPUs, so two 24GB cards hold a ~48GB model. Know the two modes: layer split (llama.cpp default — GPUs take turns, VRAM adds up but speed roughly equals one card) and tensor parallel (vLLM/ExLlama — GPUs work simultaneously; adds real speed but wants matched cards and fast interconnect). NVLink (3090s) helps tensor parallel and training; plain PCIe is fine for layer-split inference.

Budget the platform, not just the cards: two GPUs want ≥8 PCIe lanes each (HEDT/server boards do this properly; consumer boards often drop to x8/x4), a ~1200W PSU for dual high-end cards, case airflow that doesn't cook the top card, and ideally 2-slot blower or hybrid coolers. A dual-3090 build is the classic budget 70B rig; an RTX A6000-class single card does the same job with none of the complexity, at a price.

Power: the invisible line item

A 350–450W GPU that runs often is a recurring cost — at European electricity prices, a heavily-used dual-GPU rig can cost hundreds of €/year. Two mitigations: power-limit or undervolt (a 3090 capped at ~250–280W typically loses only ~5–10% inference speed — set it and forget it with nvidia-smi -pl), and match the card to the duty cycle — if the box sits idle most of the day, a smaller card or renting bursts of cloud GPU may beat owning big iron (see the cost guide and RunPod-vs-Vast comparison below).

Best picks by buyer profile

First local LLM box, tight budget — used RTX 3090 24GB. Nothing else comes close per euro.
Enthusiast / developer daily driver — RTX 4090, or 5090 (32GB) if budget allows: the 5090's ~1.8 TB/s bandwidth is the fastest consumer inference available.
Small team needing 70B on-prem — RTX A6000/6000 Ada 48GB (single-slot simplicity, pro drivers), or dual used 3090s for half the price and more assembly.
Silent desktop, biggest possible models — Mac Studio with maxed unified memory.
Budget pure-inference server — high-VRAM AMD via llama.cpp/Vulkan, eyes open about tooling.
Production / training — A100/H100/Blackwell class — but rent-by-the-hour usually beats buying until utilization is consistently high.

Common buyer mistakes

Buying VRAM on a narrow bus. 16GB on a 128-bit bus fits a 13B and then crawls. Check GB/s, not just GB.
Buying for the model you'll "eventually" run. An 80GB card to occasionally chat with a 7B is money on fire; rent the big runs instead.
Ignoring the rest of the box. The classic failure: a 4090 ordered, then the PSU, case clearance (3.5+ slots!) and airflow bill arrives.
Assuming two cards = double speed. Layer-split doubles capacity, not speed. Tensor parallel adds speed but wants matched cards.
Dismissing system RAM. 64GB of DDR lets llama.cpp offload overflow layers gracefully and cache models between runs — cheap insurance.