Should I use RAG or fine-tuning?

Use RAG to give a model knowledge (facts, documents, fresh data); use fine-tuning to change behavior, tone or format. Most production systems start with RAG and fine-tune only if behavior still needs adjusting.

Is RAG cheaper than fine-tuning?

Usually to start: RAG needs no training run and updates instantly by editing documents. Fine-tuning has upfront training cost and must be redone when requirements change; it can lower per-query cost by needing fewer prompt tokens, though.

Does fine-tuning add knowledge?

Poorly. Fine-tuning teaches patterns and style, not reliable facts — trying to inject knowledge this way is expensive and prone to hallucination. Use RAG for knowledge instead.

Can I use both together?

Yes, and for production assistants it is often the strongest setup: a fine-tuned model for consistent behavior plus RAG for current, citable knowledge.

Do huge context windows make RAG obsolete?

No. Stuffing everything into a 1M-token context costs far more per query, is slower, and recall degrades in very long contexts. Retrieval stays the efficient way to select what matters; long context helps RAG by allowing richer retrieved sets.

How many examples do I need to fine-tune?

For LoRA on a narrow behavior, roughly 500-5,000 high-quality examples often suffice; quality and consistency matter far more than volume. Broad behavior changes need more and careful evaluation.

RAG vs Fine-Tuning (2026): Which Should You Use?

The most common and most expensive mistake is fine-tuning to add knowledge — a slow, costly way to do what RAG does cheaply and instantly, and one that often makes hallucination worse. The right question is never "which is better" but "what am I trying to change: what the model knows, or how it behaves?" Answer that and the choice is obvious.

Side by side

	RAG	Fine-tuning
Changes	Knowledge	Behavior/style
Update data	Instant (edit docs)	Retrain needed
Upfront cost	Low	Higher (training + dataset)
Per-query cost	Higher (longer prompts)	Lower (fewer tokens)
Hallucination	Lower (cites sources)	Unchanged
Data freshness	Always current	Frozen at training
Auditability	High (visible sources)	Low (behavior is implicit)
Best for	Docs, FAQs, fresh facts	Tone, format, narrow tasks

RAG, beyond the one-paragraph version

The basic loop: documents → chunks → embeddings → vector DB; at query time retrieve the most relevant chunks, inject them into the prompt, answer with citations. Knowledge updates by editing documents — no retraining. That's the brochure. Production RAG quality is decided by the unglamorous parts:

Hybrid search + reranking. Pure vector similarity misses exact identifiers (codes, names, article numbers); combining it with keyword search (BM25) and reranking the merged candidates is routinely the biggest single quality jump in a RAG system.
Chunking by structure (headings, paragraphs, tables kept whole) beats fixed-size splitting; chunks must make sense alone.
Query transformation. Users ask vague questions; rewriting/expanding the query before retrieval (or retrieving for multiple reformulations) lifts recall substantially.
Agentic / multi-hop RAG. For questions that span documents ("compare policy A with contract B"), a single retrieval isn't enough — the model retrieves, reads, then retrieves again. Costlier, sometimes necessary.
Where it fails: retrieval misses (the answer wasn't in the top-k), stale indexes, contradictory documents, and questions that need synthesis across everything ("what are our top risks?") rather than lookup — RAG is a lookup mechanism, not an analyst.

Fine-tuning, beyond the one-paragraph version

Fine-tuning continues training on your examples so the model internalizes a behavior. In practice almost nobody retrains all weights: LoRA trains small low-rank adapter matrices alongside frozen weights (~0.1–1% of parameters), and QLoRA does it on top of a 4-bit-quantized base — which is why a 70B can be fine-tuned on a single 48GB GPU. The adapter is a small file you can load, swap or stack at serving time.

Where it shines: rigid output formats ("always this JSON schema" — far more reliable than prompting), brand voice, classification/extraction at scale, domain jargon fluency, and distillation — training a small model on a big model's outputs so the cheap model does the job in production. That last one is the most underused cost lever in local AI.
The dataset is the product. ~500–5,000 high-quality, consistent examples typically beat 50,000 noisy ones for a narrow behavior. Every inconsistency in your training data becomes a behavior the model learns.
The risks: catastrophic forgetting (over-aggressive tuning degrades general ability — keep rank/epochs modest and eval broadly), staleness (a new, better base model ships next quarter and your adapter doesn't transfer — budget to redo it), and false confidence (a tuned model sounds more on-brand even when wrong).
Beyond supervised: preference tuning (DPO) shapes style/judgment from chosen-vs-rejected pairs — useful once you have user feedback, overkill before.

"But long context kills RAG" — no, it doesn't

Models with giant context windows tempt a shortcut: skip retrieval, paste everything in. Three reasons this loses in production: cost — every query pays for hundreds of thousands of tokens the answer didn't need (and prompt processing time to match); recall — models demonstrably lose precision in very long contexts (the "needle in a haystack" gets harder as the haystack grows, especially mid-context); freshness/permissions — you still need a system that knows which documents exist and who may see them, which is… a retrieval system. Long context is a gift to RAG: it lets you retrieve richer, longer chunks without triage anxiety. It doesn't replace selecting what matters.

Combining them (the production pattern)

Fine-tune for behavior + RAG for knowledge — the standard strong setup: the adapter enforces tone, format and citation discipline; retrieval supplies current facts.
Fine-tune the model to be better at RAG (sometimes called RAFT): train on examples of answering from provided context — quoting accurately, refusing when the context lacks the answer. Fixes the most annoying RAG behaviors directly.
Distill + RAG: small fine-tuned model with good retrieval ≈ big generic model, at a fraction of the serving cost — the sweet spot for on-prem deployments with limited VRAM (see the private-ChatGPT guide).

Decision matrix by scenario

Support bot over your docs → RAG (+ later, light LoRA for tone).
"Answer strictly in our JSON schema" → fine-tuning (structured output is its home turf).
Classifier / extractor at high volume → fine-tune a small model; serve it cheap.
Brand-voice writer → fine-tuning (style is behavior).
Domain expert assistant (law, pharma, engineering) → RAG for the corpus + fine-tune for domain style and citation discipline.
Freshness-critical (prices, policies, tickets) → RAG only; retrain cycles can't chase reality.

Common mistakes

Fine-tuning to add facts. The classic. Expensive, unreliable, hallucination-prone — that's RAG's job.
Blaming the model for retrieval failures. Before upgrading to a 70B, log what was actually retrieved: most "wrong answers" are wrong chunks.
Skipping the eval set. ~50–100 known-answer questions, measured before/after every change, or you're steering by anecdote.
Training on inconsistent data. The model learns your dataset's noise as policy.
Jumping to fine-tuning before exhausting prompting. A good system prompt plus few-shot examples solves more than teams expect, at zero cost.