LoRA & QLoRA – LLM Glossary

LoRA (Low-Rank Adaptation, Hu et al. 2021) avoids updating all model weights by adding small trainable matrices alongside frozen pretrained weights. QLoRA extends this by quantizing the frozen base model to 4-bit, making large-model fine-tuning accessible on consumer hardware.

How LoRA Works

For a weight matrix W (d×k), LoRA adds: W' = W + A × B where A is (d×r) and B is (r×k), with rank r ≪ min(d,k). Only A and B (typically 0.1–1% of original parameters) are trained. After training, the adapter can be merged into the base weights (zero inference overhead) or kept separate for hot-swapping.

Key Hyperparameters

Parameter	Typical Value	Effect
rank (r)	4–64	Higher = more capacity, more VRAM
alpha	= r or 2×r	Scaling factor (effective LR of adapter)
target modules	q_proj, v_proj (min), all linear (max)	Which layers get adapters
dropout	0.05–0.1	Regularisation for small datasets

LoRA vs QLoRA

LoRA

Base model in FP16/BF16. Best quality. Needs ~2× inference VRAM for training (gradients + optimiser states). 7B: ~28GB, 70B: ~140GB.

QLoRA

Base model in NF4 (4-bit). Adapter in BF16. Dequantisation on the fly during forward pass. 7B: ~6GB, 70B: ~35GB. Near-LoRA quality with consumer hardware.

VRAM Requirements for Fine-Tuning

Model	Full FT	LoRA FP16	QLoRA 4-bit
7B	112 GB	28 GB	6–8 GB
13B	208 GB	52 GB	10–12 GB
70B	~1 TB	140 GB	35–40 GB

Recommended toolkits: Unsloth (2× faster than HuggingFace PEFT, supports Llama/Mixtral/Qwen), LLaMA-Factory (GUI + CLI), Axolotl (config-driven, production-grade).