Perplexity is the exponentiated average negative log-likelihood of a text sequence under the model. Intuitively, it measures how "surprised" the model is by the text — a perfect model that always assigns probability 1 to the correct next token has PPL = 1.
Formula
PPL = exp( -1/N × Σ log P(token_i | context) )
A model with PPL = 5 on a dataset is "choosing between 5 equally likely next tokens on average." Lower is better.
PPL as a Quantization Quality Metric
| Format | Llama 3 8B PPL (WikiText-2) | Degradation vs FP16 |
|---|---|---|
| FP16 | 6.14 | — |
| Q8_0 | 6.17 | +0.5% |
| Q6_K | 6.20 | +1.0% |
| Q5_K_M | 6.25 | +1.8% |
| Q4_K_M | 6.35 | +3.4% |
| Q3_K_M | 6.73 | +9.6% |
| Q2_K | 8.10 | +32% |
Limitations of PPL as a Quality Proxy
PPL measures general language fluency. A 3% perplexity increase from Q4 quantization may cause no perceptible degradation on most chat or instruction-following tasks, but a 30% increase (Q2) shows clearly in coherence. For task-specific evaluation, always measure on your actual use case (benchmark accuracy, ROUGE, human preference) rather than relying solely on PPL.