LoRA (Low-Rank Adaptation, Hu et al. 2021) avoids updating all model weights by adding small trainable matrices alongside frozen pretrained weights. QLoRA extends this by quantizing the frozen base model to 4-bit, making large-model fine-tuning accessible on consumer hardware.
How LoRA Works
For a weight matrix W (d×k), LoRA adds: W' = W + A × B where A is (d×r) and B is (r×k), with rank r ≪ min(d,k). Only A and B (typically 0.1–1% of original parameters) are trained. After training, the adapter can be merged into the base weights (zero inference overhead) or kept separate for hot-swapping.
Key Hyperparameters
| Parameter | Typical Value | Effect |
|---|---|---|
| rank (r) | 4–64 | Higher = more capacity, more VRAM |
| alpha | = r or 2×r | Scaling factor (effective LR of adapter) |
| target modules | q_proj, v_proj (min), all linear (max) | Which layers get adapters |
| dropout | 0.05–0.1 | Regularisation for small datasets |
LoRA vs QLoRA
LoRA
Base model in FP16/BF16. Best quality. Needs ~2× inference VRAM for training (gradients + optimiser states). 7B: ~28GB, 70B: ~140GB.
QLoRA
Base model in NF4 (4-bit). Adapter in BF16. Dequantisation on the fly during forward pass. 7B: ~6GB, 70B: ~35GB. Near-LoRA quality with consumer hardware.
VRAM Requirements for Fine-Tuning
| Model | Full FT | LoRA FP16 | QLoRA 4-bit |
|---|---|---|---|
| 7B | 112 GB | 28 GB | 6–8 GB |
| 13B | 208 GB | 52 GB | 10–12 GB |
| 70B | ~1 TB | 140 GB | 35–40 GB |
Recommended toolkits: Unsloth (2× faster than HuggingFace PEFT, supports Llama/Mixtral/Qwen), LLaMA-Factory (GUI + CLI), Axolotl (config-driven, production-grade).