LoRA & QLoRA

Training

Parameter-efficient fine-tuning adapters that inject trainable low-rank matrices into transformer layers — allowing 70B model fine-tuning on consumer hardware.

LoRA (Low-Rank Adaptation, Hu et al. 2021) avoids updating all model weights by adding small trainable matrices alongside frozen pretrained weights. QLoRA extends this by quantizing the frozen base model to 4-bit, making large-model fine-tuning accessible on consumer hardware.

How LoRA Works

For a weight matrix W (d×k), LoRA adds: W' = W + A × B where A is (d×r) and B is (r×k), with rank r ≪ min(d,k). Only A and B (typically 0.1–1% of original parameters) are trained. After training, the adapter can be merged into the base weights (zero inference overhead) or kept separate for hot-swapping.

Key Hyperparameters

ParameterTypical ValueEffect
rank (r)4–64Higher = more capacity, more VRAM
alpha= r or 2×rScaling factor (effective LR of adapter)
target modulesq_proj, v_proj (min), all linear (max)Which layers get adapters
dropout0.05–0.1Regularisation for small datasets

LoRA vs QLoRA

LoRA

Base model in FP16/BF16. Best quality. Needs ~2× inference VRAM for training (gradients + optimiser states). 7B: ~28GB, 70B: ~140GB.

QLoRA

Base model in NF4 (4-bit). Adapter in BF16. Dequantisation on the fly during forward pass. 7B: ~6GB, 70B: ~35GB. Near-LoRA quality with consumer hardware.

VRAM Requirements for Fine-Tuning

ModelFull FTLoRA FP16QLoRA 4-bit
7B112 GB28 GB6–8 GB
13B208 GB52 GB10–12 GB
70B~1 TB140 GB35–40 GB

Recommended toolkits: Unsloth (2× faster than HuggingFace PEFT, supports Llama/Mixtral/Qwen), LLaMA-Factory (GUI + CLI), Axolotl (config-driven, production-grade).