HellaSwag
Commonsense reasoning — pick the most plausible continuation of an everyday activity.
What It Tests
HellaSwag tests commonsense inference: given a short description of an everyday activity, which of 4 sentence continuations is most plausible? Created by Rowan Zellers and colleagues at the University of Washington and published at ACL 2019, HellaSwag was notable for its adversarial construction method.
Wrong answer choices are generated by a discriminator model (HellaSwag = 'Hard' + SWAG), then filtered to keep only those that fool classifiers but are obviously wrong to humans. This adversarial filtering made the benchmark substantially harder for NLP systems of 2019 while remaining easy for humans (95.6% accuracy).
HellaSwag was historically important: BERT scored only 40.2% at launch while humans scored 95.6% — a dramatic 55-point gap that demonstrated language models lacked basic commonsense understanding. The benchmark became one of the five core evaluations on the Hugging Face Open LLM Leaderboard.
HellaSwag is now fully saturated. Top models score 95-96%, essentially matching human performance. It provides no useful signal for frontier model comparison but remains a component of many evaluation suites for legacy reasons.
Task Anatomy
How a single task is structured.
Example Tasks
2 real examples from the benchmark.
Piano playing
Problem / Input
Adversarially generated distractors (B, C, D) are grammatically correct and vaguely related but physically/contextually implausible. Humans find this trivial; 2019 NLP systems struggled.
Cooking — Harder
Problem / Input
Tests procedural knowledge about everyday activities — knowledge that is common for humans but wasn't well-represented in early NLP training data.
Leaderboard Results
Model scores sorted by performance.
6 results
| # | Model | Score |
|---|---|---|
| 1 | GPT-5 | V96.4% |
| 2 | Opus 4.5 | V~96% |
| 3 | DeepSeek V3 | V95.8% |
| 4 | GPT-4o | V95.7% |
| 5 | Llama 4 Maverick | V~95% |
| 6 | Qwen 3 72B | V~95% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
BERT scoring 40.2% vs. human 95.6% at launch in 2019 was a landmark demonstration that NLP systems lacked basic commonsense reasoning — a benchmark result that influenced the field's research direction.
Adversarial filtering methodology proved that traditional benchmark construction (hand-crafted wrong answers) was too easy — systems could solve questions via style matching rather than meaning.
Within 5 years, the 55-point human-AI gap completely closed. The speed of progress makes HellaSwag a historical artifact rather than an active challenge.
Controversies & Caveats
Known limitations and criticisms.
GoldenSwag: researchers found that some HellaSwag 'wrong' answers are actually plausible, and some 'correct' answers are questionable. The adversarial construction methodology is imperfect.
ACL 2025 papers on HellaSwag validity showed that for frontier models, performance is essentially at human ceiling and differences between models at >95% are within annotation noise.
Still appears on most model cards and leaderboards for legacy/comparability reasons even though it provides no frontier discrimination.