Local LLM Setup

Running a Large Language Model on your own hardware eliminates API costs, protects data privacy, and gives you full control over model selection and inference parameters. This is AI-Radar's flagship guide — the site's most differentiating topic, built on first-hand infrastructure experience.

On This Page

Hardware Requirements

Model Size Min VRAM (quantized) Recommended GPU Models
1–3B 2–4 GB GTX 1660 / integrated Phi-3 Mini, Qwen2 1.5B
7B 4–6 GB (Q4_K_M) RTX 3060 12GB / RX 6600 XT Llama 3.1 8B, Mistral 7B, Qwen2.5 7B
13B 8–10 GB (Q4_K_M) RTX 3080 10GB / RTX 4070 Llama 2 13B, CodeLlama 13B
30–34B 16–20 GB (Q4_K_M) RTX 3090 / RTX 4090 Yi-34B, CodeLlama 34B
70B 40 GB (Q4_K_M) / 2×24GB 2× RTX 3090 / A100 40GB Llama 3.1 70B, Qwen2.5 72B

For CPU-only inference: expect 5–15× slower than GPU, but feasible for 7B–13B models on modern CPUs with 32+ GB RAM. Apple Silicon (M2/M3/M4) is the best consumer CPU option — unified memory allows 64+ GB effective VRAM at high bandwidth.

Inference Runtimes

Ollama

The simplest way to run LLMs locally. One-command install, model library, OpenAI-compatible API. Best for development and single-user deployments.

ollama run llama3.2:3b

llama.cpp

High-performance inference, best GGUF support, multi-GPU tensor parallelism, CPU fallback. Ideal for production server-mode deployments.

./llama-server -m model.gguf -ngl 35

LM Studio

Desktop GUI for exploring and running models. Best for non-technical users and rapid model evaluation. Includes local API server mode.

GUI: lmstudio.ai

vLLM

Continuous batching, PagedAttention — highest throughput for multi-user server deployments. OpenAI-compatible. Requires NVIDIA GPU, best on A100/H100.

vllm serve llama3-8b-instruct

Quantization

Quantization reduces model precision (from 16-bit to 4–8 bit) to fit larger models in less VRAM with minimal quality loss. GGUF quantization formats for llama.cpp / Ollama:

Format Size vs FP16 Quality Loss Use Case
Q4_K_M ~28% Very Low Best all-round (recommended)
Q5_K_M ~35% Minimal When you have RAM to spare
Q3_K_M ~22% Medium Very constrained VRAM
Q2_K ~16% High Experimentation only

Step-by-Step Setup (Ollama)

1
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
2
Pull a model
ollama pull llama3.2:3b
3
Run inference
ollama run llama3.2:3b
4
Start API server
ollama serve # default: http://localhost:11434
5
Query via curl
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Hello"}'

Production Deployment

For production, Ollama or llama.cpp run inside Docker alongside your application. Key production considerations:

  • Docker Compose: Run Ollama, your API (FastAPI), and database in the same Compose stack on a shared network
  • GPU passthrough: Add deploy.resources.reservations.devices with NVIDIA capability = "gpu" in Docker Compose
  • Context window: Set OLLAMA_NUM_CTX=8192 for longer conversations; more context = more VRAM
  • Concurrency: Ollama handles one request at a time by default; use vLLM for multi-user production workloads
  • Rate limiting: Protect your endpoint — a single 70B model inference can saturate a GPU for 30+ seconds

Model Selection Guide

General Purpose

Llama 3.1/3.2, Mistral 7B, Qwen2.5

Best quality/size ratio

Code Generation

CodeLlama 13B/34B, DeepSeek Coder, Qwen2.5-Coder

Fine-tuned on code datasets

Reasoning

DeepSeek-R1, Qwen3 (thinking), Llama 3.3 70B

Extended chain-of-thought

Embedding / RAG

nomic-embed-text, all-MiniLM-L6-v2, mxbai-embed-large

For vector search, not generation

Related Resources

Latest Local LLM Articles