← Back to BenchmarksPrimary source ↗
reasoningNEWPending curation
HLE
2,500 expert-crafted academic questions spanning math, natural sciences, and humanities, designed to be Google-proof and exceed frontier model capabilities. Published in Nature (Jan 2025).
Year2025
Why our crawl picked it up
Notes the discovery agent wrote when proposing this benchmark.
Directly addresses saturation of MMLU (>90% AI accuracy). Questions are written by domain-expert contributors and are unambiguous yet cannot be retrieved online. State-of-the-art models score <45%, making it one of the most discriminative frontier benchmarks available. Multi-modal variant included.
Source
This entry was added by an automated crawl and hasn't been curated yet. Once it's reviewed and promoted into the bundled set, you'll see task anatomy, examples, scores, and richer context here.