TruthfulQA (Lin et al., 2022) probes whether language models generate truthful answers or confidently repeat common human misconceptions, pseudoscience, conspiracy theories, and urban legends. It operationalises truthfulness as "only asserting things the model has reason to believe are true."
Structure
| Property | Detail |
|---|---|
| Questions | 817 (38 categories) |
| Task type | Open-ended generation (evaluated by GPT-4 judge) OR multiple choice (MC1/MC2) |
| Categories | Conspiracies, misconceptions, health, law, finance, fiction, politics |
| Metric | % Truthful × % Informative (jointly) |
The "Imitative Falsehood" Problem
Larger models trained on more human text can actually score lower on TruthfulQA — because they more faithfully replicate popular but false human beliefs. GPT-3 175B scored worse than GPT-3 6.7B on this benchmark at release, a counter-intuitive scaling failure.
Scores (MC1, 0-shot)
| Model | Truthful (%) |
|---|---|
| Human | 94% |
| GPT-4o | 86.8% |
| Claude 3 Opus | 88.5% |
| Llama 3.1 70B | 82.1% |
| Llama 3.1 8B | 69.3% |
Why It Matters for On-Premise
In enterprise or professional on-premise deployments, false confident answers (hallucinations) carry real business risk. Low TruthfulQA scores in your on-premise model should be compensated with RAG that grounds answers in verified documents, or with output review workflows for high-stakes decisions.