Local LLM Setup
Running a Large Language Model on your own hardware eliminates API costs, protects data privacy, and gives you full control over model selection and inference parameters. This is AI-Radar's flagship guide — the site's most differentiating topic, built on first-hand infrastructure experience.
On This Page
Hardware Requirements
| Model Size | Min VRAM (quantized) | Recommended GPU | Models |
|---|---|---|---|
| 1–3B | 2–4 GB | GTX 1660 / integrated | Phi-3 Mini, Qwen2 1.5B |
| 7B | 4–6 GB (Q4_K_M) | RTX 3060 12GB / RX 6600 XT | Llama 3.1 8B, Mistral 7B, Qwen2.5 7B |
| 13B | 8–10 GB (Q4_K_M) | RTX 3080 10GB / RTX 4070 | Llama 2 13B, CodeLlama 13B |
| 30–34B | 16–20 GB (Q4_K_M) | RTX 3090 / RTX 4090 | Yi-34B, CodeLlama 34B |
| 70B | 40 GB (Q4_K_M) / 2×24GB | 2× RTX 3090 / A100 40GB | Llama 3.1 70B, Qwen2.5 72B |
For CPU-only inference: expect 5–15× slower than GPU, but feasible for 7B–13B models on modern CPUs with 32+ GB RAM. Apple Silicon (M2/M3/M4) is the best consumer CPU option — unified memory allows 64+ GB effective VRAM at high bandwidth.
Inference Runtimes
Ollama
The simplest way to run LLMs locally. One-command install, model library, OpenAI-compatible API. Best for development and single-user deployments.
ollama run llama3.2:3b
llama.cpp
High-performance inference, best GGUF support, multi-GPU tensor parallelism, CPU fallback. Ideal for production server-mode deployments.
./llama-server -m model.gguf -ngl 35
LM Studio
Desktop GUI for exploring and running models. Best for non-technical users and rapid model evaluation. Includes local API server mode.
GUI: lmstudio.ai
vLLM
Continuous batching, PagedAttention — highest throughput for multi-user server deployments. OpenAI-compatible. Requires NVIDIA GPU, best on A100/H100.
vllm serve llama3-8b-instruct
Quantization
Quantization reduces model precision (from 16-bit to 4–8 bit) to fit larger models in less VRAM with minimal quality loss. GGUF quantization formats for llama.cpp / Ollama:
| Format | Size vs FP16 | Quality Loss | Use Case |
|---|---|---|---|
| Q4_K_M | ~28% | Very Low | Best all-round (recommended) |
| Q5_K_M | ~35% | Minimal | When you have RAM to spare |
| Q3_K_M | ~22% | Medium | Very constrained VRAM |
| Q2_K | ~16% | High | Experimentation only |
Step-by-Step Setup (Ollama)
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2:3b
ollama run llama3.2:3b
ollama serve # default: http://localhost:11434
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Hello"}'
Production Deployment
For production, Ollama or llama.cpp run inside Docker alongside your application. Key production considerations:
- Docker Compose: Run Ollama, your API (FastAPI), and database in the same Compose stack on a shared network
- GPU passthrough: Add
deploy.resources.reservations.deviceswith NVIDIA capability = "gpu" in Docker Compose - Context window: Set
OLLAMA_NUM_CTX=8192for longer conversations; more context = more VRAM - Concurrency: Ollama handles one request at a time by default; use vLLM for multi-user production workloads
- Rate limiting: Protect your endpoint — a single 70B model inference can saturate a GPU for 30+ seconds
Model Selection Guide
General Purpose
Llama 3.1/3.2, Mistral 7B, Qwen2.5
Best quality/size ratioCode Generation
CodeLlama 13B/34B, DeepSeek Coder, Qwen2.5-Coder
Fine-tuned on code datasetsReasoning
DeepSeek-R1, Qwen3 (thinking), Llama 3.3 70B
Extended chain-of-thoughtEmbedding / RAG
nomic-embed-text, all-MiniLM-L6-v2, mxbai-embed-large
For vector search, not generation