Fine-tuning adapts a general-purpose pre-trained model (like Llama 3) to a specific domain, task, or communication style by training it further on curated data. Unlike RAG, fine-tuning changes the model's weights permanently.
Fine-Tuning Methods
Full Fine-Tuning
All model weights are updated. Best quality results but requires enormous VRAM (70B model = ~140GB+ for training). Rarely practical on-premise.
SFT (Supervised Fine-Tuning)
Training on paired instruction-response examples. Standard first step for chat models. Can be full or PEFT.
LoRA / QLoRA
Parameter-efficient: only small adapter matrices are trained. 7B model fine-tunable on a single 24GB consumer GPU with QLoRA. Industry default.
Continued Pre-training
Run next-token prediction on raw domain text (manuals, code, scientific papers) before SFT. Injects domain knowledge at the weight level.
Fine-Tuning vs RAG
| Fine-Tuning | RAG | |
|---|---|---|
| Updates weights | Yes | No |
| Update cost | High (retrain) | Low (re-embed) |
| Knowledge freshness | Static post-training | Real-time |
| Hallucination risk | Baked-in errors | Grounded in source |
| Best for | Style, tone, task logic | Factual, changing data |
Dataset Requirements
For SFT, 500–2000 high-quality examples are sufficient for task adaptation. For continued pre-training, millions of tokens of domain text improve results. Curate carefully — garbage data degrades the base model. Tools: Axolotl, LLaMA-Factory, Unsloth (fastest, 2× faster than standard transformers).