Perplexity (PPL)

Metric

A measure of how well a language model predicts a sample of text — lower perplexity means better language modelling. Used to measure quality degradation from quantization.

Perplexity is the exponentiated average negative log-likelihood of a text sequence under the model. Intuitively, it measures how "surprised" the model is by the text — a perfect model that always assigns probability 1 to the correct next token has PPL = 1.

Formula

PPL = exp( -1/N × Σ log P(token_i | context) )

A model with PPL = 5 on a dataset is "choosing between 5 equally likely next tokens on average." Lower is better.

PPL as a Quantization Quality Metric

FormatLlama 3 8B PPL (WikiText-2)Degradation vs FP16
FP166.14
Q8_06.17+0.5%
Q6_K6.20+1.0%
Q5_K_M6.25+1.8%
Q4_K_M6.35+3.4%
Q3_K_M6.73+9.6%
Q2_K8.10+32%

Limitations of PPL as a Quality Proxy

PPL measures general language fluency. A 3% perplexity increase from Q4 quantization may cause no perceptible degradation on most chat or instruction-following tasks, but a 30% increase (Q2) shows clearly in coherence. For task-specific evaluation, always measure on your actual use case (benchmark accuracy, ROUGE, human preference) rather than relying solely on PPL.