TruthfulQA
Can AI avoid repeating common myths and falsehoods that pervade its training data?
What It Tests
TruthfulQA tests a specific failure mode: imitative falsehoods — incorrect answers that language models generate not because they lack knowledge, but because they have learned to mimic human misconceptions that appear frequently in training data. Created by Stephanie Lin, Jacob Hilton, and Owain Evans at OpenAI, published at ACL 2022.
The benchmark targets questions where humans commonly believe wrong things — myths, superstitions, false folk wisdom, and misconceptions that circulate widely on the internet. The central hypothesis: because larger language models are better at modeling the training distribution (which contains human falsehoods), they will be more prone to generating falsehoods. This is called inverse scaling.
The landmark finding confirmed this hypothesis: GPT-3's largest variant (175B) was the LEAST truthful model tested, scoring lower than smaller GPT-3 variants. Larger models better reproduced training data falsehoods. GPT-J 6B was 17% less truthful than its 125M-parameter counterpart.
This inverse scaling finding has since reversed for RLHF-trained models. Modern instruction-tuned models specifically target this failure mode, and the most capable chat models now score best. TruthfulQA is now most useful for revealing whether models have been fine-tuned to resist common misconceptions, and for studying the difference between base models and instruction-tuned variants.
Task Anatomy
How a single task is structured.
Example Tasks
4 real examples from the benchmark.
Health — Cold weather myth
Problem / Input
This is the classic example from the paper. Larger base models reproduce the myth more confidently than smaller ones.
Superstition — Broken mirror
Problem / Input
This example appears as Figure 3 in the original paper, illustrating the inverse scaling effect. GPT-3 small got this right; GPT-3 large produced the superstition.
Law — Right to remain silent
Problem / Input
High-stakes category: wrong answers here could cause real harm. Tests whether models can resist widely circulated but factually wrong legal advice.
Statistics — Fastest growing religion
Problem / Input
Tests whether models can express appropriate uncertainty on contested factual claims rather than repeating common internet assertions.
Leaderboard Results
Model scores sorted by performance.
7 results
| # | Model | Score |
|---|---|---|
| 1 | Opus 4.5 | V~78% |
| 2 | GPT-5 | V~75% |
| 3 | Sonnet 4 | V~72% |
| 4 | GPT-4o | ~69% |
| 5 | Mistral Large 3 | V~65% |
| 6 | Llama 4 Maverick | V~62% |
| 7 | DeepSeek V3 | ~60% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
Inverse scaling (original finding): GPT-3 175B was LESS truthful than GPT-3 350M. Larger base models better reproduce training data falsehoods — the opposite of every other NLP benchmark.
RLHF reversed the scaling law: instruction-tuned models specifically targeted to resist misconceptions now score best. The inverse scaling story is a historical artifact of base model evaluation.
The gap between generation and MC scores for the same model (~15-20%) reveals an important distinction: models may 'know' the correct answer in a classification sense but still actively generate the false one.
Human baseline of 94% sets a high ceiling that even top models haven't fully reached, suggesting active misconception-resistance remains an unsolved alignment challenge.
Controversies & Caveats
Known limitations and criticisms.
Easy to overfit: TruthfulQA is public and models can be fine-tuned specifically on TruthfulQA-style questions to inflate scores without improving general truthfulness.
The landmark inverse scaling finding has been reversed by RLHF training — the paper's central conclusion is now historically contingent, not a fundamental property of language model scaling.
The generation task uses a fine-tuned GPT-3 as judge for truthfulness and informativeness. As models surpass GPT-3, this judge becomes increasingly unreliable.
817 questions in 38 categories = ~21 questions per category on average — too few for reliable statistics on individual categories.
Definition of truthfulness is contested: some questions conflate scientific consensus with empirically contested claims; others reflect US-centric norms.