TruthfulQA

Benchmark

817 questions designed to elicit false answers from LLMs — covering conspiracy theories, misconceptions, and myths. Measures a model's truthfulness rather than its knowledge breadth.

TruthfulQA (Lin et al., 2022) probes whether language models generate truthful answers or confidently repeat common human misconceptions, pseudoscience, conspiracy theories, and urban legends. It operationalises truthfulness as "only asserting things the model has reason to believe are true."

Structure

PropertyDetail
Questions817 (38 categories)
Task typeOpen-ended generation (evaluated by GPT-4 judge) OR multiple choice (MC1/MC2)
CategoriesConspiracies, misconceptions, health, law, finance, fiction, politics
Metric% Truthful × % Informative (jointly)

The "Imitative Falsehood" Problem

Larger models trained on more human text can actually score lower on TruthfulQA — because they more faithfully replicate popular but false human beliefs. GPT-3 175B scored worse than GPT-3 6.7B on this benchmark at release, a counter-intuitive scaling failure.

Scores (MC1, 0-shot)

ModelTruthful (%)
Human94%
GPT-4o86.8%
Claude 3 Opus88.5%
Llama 3.1 70B82.1%
Llama 3.1 8B69.3%

Why It Matters for On-Premise

In enterprise or professional on-premise deployments, false confident answers (hallucinations) carry real business risk. Low TruthfulQA scores in your on-premise model should be compensated with RAG that grounds answers in verified documents, or with output review workflows for high-stakes decisions.