classicPending curation

ARC-Challenge + HellaSwag

Two classic saturated benchmarks — ARC tests grade-school science reasoning (2,590 questions), HellaSwag tests commonsense sentence completion (10,000+ questions). Both exceeded by frontier models at ~95–98% accuracy.

Why our crawl picked it up

Notes the discovery agent wrote when proposing this benchmark.

(no notes recorded)

This entry was added by an automated crawl and hasn't been curated yet. Once it's reviewed and promoted into the bundled set, you'll see task anatomy, examples, scores, and richer context here.