Local LLM Setup

Running a Large Language Model on your own hardware eliminates API costs, protects data privacy, and gives you full control over model selection and inference parameters. This is AI-Radar's flagship guide — the site's most differentiating topic, built on first-hand infrastructure experience.

Hardware Requirements

Model Size	Min VRAM (quantized)	Recommended GPU	Models
1–3B	2–4 GB	GTX 1660 / integrated	Phi-3 Mini, Qwen2 1.5B
7B	4–6 GB (Q4_K_M)	RTX 3060 12GB / RX 6600 XT	Llama 3.1 8B, Mistral 7B, Qwen2.5 7B
13B	8–10 GB (Q4_K_M)	RTX 3080 10GB / RTX 4070	Llama 2 13B, CodeLlama 13B
30–34B	16–20 GB (Q4_K_M)	RTX 3090 / RTX 4090	Yi-34B, CodeLlama 34B
70B	40 GB (Q4_K_M) / 2×24GB	2× RTX 3090 / A100 40GB	Llama 3.1 70B, Qwen2.5 72B

For CPU-only inference: expect 5–15× slower than GPU, but feasible for 7B–13B models on modern CPUs with 32+ GB RAM. Apple Silicon (M2/M3/M4) is the best consumer CPU option — unified memory allows 64+ GB effective VRAM at high bandwidth.

Inference Runtimes

Ollama

The simplest way to run LLMs locally. One-command install, model library, OpenAI-compatible API. Best for development and single-user deployments.

ollama run llama3.2:3b

llama.cpp

High-performance inference, best GGUF support, multi-GPU tensor parallelism, CPU fallback. Ideal for production server-mode deployments.

./llama-server -m model.gguf -ngl 35

LM Studio

Desktop GUI for exploring and running models. Best for non-technical users and rapid model evaluation. Includes local API server mode.

GUI: lmstudio.ai

vLLM

Continuous batching, PagedAttention — highest throughput for multi-user server deployments. OpenAI-compatible. Requires NVIDIA GPU, best on A100/H100.

vllm serve llama3-8b-instruct

Quantization

Quantization reduces model precision (from 16-bit to 4–8 bit) to fit larger models in less VRAM with minimal quality loss. GGUF quantization formats for llama.cpp / Ollama:

Format	Size vs FP16	Quality Loss	Use Case
Q4_K_M	~28%	Very Low	Best all-round (recommended)
Q5_K_M	~35%	Minimal	When you have RAM to spare
Q3_K_M	~22%	Medium	Very constrained VRAM
Q2_K	~16%	High	Experimentation only

Step-by-Step Setup (Ollama)

Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

Pull a model
ollama pull llama3.2:3b

Run inference
ollama run llama3.2:3b

Start API server
ollama serve # default: http://localhost:11434

Query via curl
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Hello"}'

Production Deployment

For production, Ollama or llama.cpp run inside Docker alongside your application. Key production considerations:

Docker Compose: Run Ollama, your API (FastAPI), and database in the same Compose stack on a shared network
GPU passthrough: Add deploy.resources.reservations.devices with NVIDIA capability = "gpu" in Docker Compose
Context window: Set OLLAMA_NUM_CTX=8192 for longer conversations; more context = more VRAM
Concurrency: Ollama handles one request at a time by default; use vLLM for multi-user production workloads
Rate limiting: Protect your endpoint — a single 70B model inference can saturate a GPU for 30+ seconds

Model Selection Guide

General Purpose

Llama 3.1/3.2, Mistral 7B, Qwen2.5

Best quality/size ratio

Code Generation

CodeLlama 13B/34B, DeepSeek Coder, Qwen2.5-Coder

Fine-tuned on code datasets

Reasoning

DeepSeek-R1, Qwen3 (thinking), Llama 3.3 70B

Extended chain-of-thought

Embedding / RAG

nomic-embed-text, all-MiniLM-L6-v2, mxbai-embed-large

For vector search, not generation

Local LLM Setup

On This Page

Hardware Requirements

Inference Runtimes

Ollama

llama.cpp

LM Studio

vLLM

Quantization

Step-by-Step Setup (Ollama)

Production Deployment

Model Selection Guide

General Purpose

Code Generation

Reasoning

Embedding / RAG

Related Resources

Latest Local LLM Articles

HBM4 mass production delayed as Nvidia pushes memory specs higher

Meta and EssilorLuxottica launch budget AI glasses, bringing portfolio to four products

Energy Efficiency and TCO Drive AI Hardware Competition

Asus denies RTX 5070 Ti and RTX 5060 Ti discontinuation

MediaTek to be early adopter of TSMC 2nm, A14 processes, focuses on boosting AI computing power

AMD Ryzen AI 5 435G: A New Zen 5 Chip for Local AI

MSI Afterburner adds 16-pin power connector warning for its MPG AI PSUs

Agentic AI shift puts CPUs at the core as Arm rolls out first chip

Intel and Advanced Packaging: A Multi-Billion Dollar Bet for the AI Era

Intel's Nova Lake: 52 cores and up to 474W for the next-gen desktop