benchmark.darvinyi.com
← Back to Benchmarks
ReasoningSaturated

HellaSwag

Commonsense reasoning — pick the most plausible continuation of an everyday activity.

Tasks70,000
Year2019
CreatorRowan Zellers, Ari Holtzman, Yonatan Bisk, et al.
MetricAccuracy
Human Baseline95.6%
Random Chance25%

What It Tests

HellaSwag tests commonsense inference: given a short description of an everyday activity, which of 4 sentence continuations is most plausible? Created by Rowan Zellers and colleagues at the University of Washington and published at ACL 2019, HellaSwag was notable for its adversarial construction method.

Wrong answer choices are generated by a discriminator model (HellaSwag = 'Hard' + SWAG), then filtered to keep only those that fool classifiers but are obviously wrong to humans. This adversarial filtering made the benchmark substantially harder for NLP systems of 2019 while remaining easy for humans (95.6% accuracy).

HellaSwag was historically important: BERT scored only 40.2% at launch while humans scored 95.6% — a dramatic 55-point gap that demonstrated language models lacked basic commonsense understanding. The benchmark became one of the five core evaluations on the Hugging Face Open LLM Leaderboard.

HellaSwag is now fully saturated. Top models score 95-96%, essentially matching human performance. It provides no useful signal for frontier model comparison but remains a component of many evaluation suites for legacy reasons.

Task Anatomy

How a single task is structured.

InputA 1-2 sentence context describing an everyday activity (from WikiHow or ActivityNet Captions), followed by 4 possible sentence continuations. Wrong choices are adversarially generated to fool NLP classifiers.
OutputModel selects the most plausible continuation.
EvaluationAccuracy. Random baseline = 25%. Human performance = 95.6%.
MetricAccuracy

Example Tasks

2 real examples from the benchmark.

#1

Piano playing

Easy for humans

Problem / Input

Context: A woman sits at a piano. Which continuation is most plausible? A) She sets her fingers on the keys. B) She bends down to tie her shoes. C) She opens a book on the stand. D) She claps her hands together.
AnswerA) She sets her fingers on the keys.

Adversarially generated distractors (B, C, D) are grammatically correct and vaguely related but physically/contextually implausible. Humans find this trivial; 2019 NLP systems struggled.

#2

Cooking — Harder

Medium (for 2019 systems)

Problem / Input

Context: A person is making pasta. They fill a large pot with water and put it on the stove. They turn the burner on high. Which continuation is most plausible? A) They begin shredding cheese into the empty pot. B) They wait for the water to boil before adding the pasta. C) They add the pasta immediately while the water is cold. D) They pour the water into the sink to drain.
AnswerB) They wait for the water to boil before adding the pasta.

Tests procedural knowledge about everyday activities — knowledge that is common for humans but wasn't well-represented in early NLP training data.

Leaderboard Results

Model scores sorted by performance.

6 results

Sort:
#ModelScore
1
GPT-5
V96.4%
2
Opus 4.5
V~96%
3
DeepSeek V3
V95.8%
4
GPT-4o
V95.7%
5
Llama 4 Maverick
V~95%
6
Qwen 3 72B
V~95%

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

  • BERT scoring 40.2% vs. human 95.6% at launch in 2019 was a landmark demonstration that NLP systems lacked basic commonsense reasoning — a benchmark result that influenced the field's research direction.

  • Adversarial filtering methodology proved that traditional benchmark construction (hand-crafted wrong answers) was too easy — systems could solve questions via style matching rather than meaning.

  • Within 5 years, the 55-point human-AI gap completely closed. The speed of progress makes HellaSwag a historical artifact rather than an active challenge.

Controversies & Caveats

Known limitations and criticisms.

GoldenSwag: researchers found that some HellaSwag 'wrong' answers are actually plausible, and some 'correct' answers are questionable. The adversarial construction methodology is imperfect.

ACL 2025 papers on HellaSwag validity showed that for frontier models, performance is essentially at human ceiling and differences between models at >95% are within annotation noise.

Still appears on most model cards and leaderboards for legacy/comparability reasons even though it provides no frontier discrimination.

Links