reasoningNEWPending curation

HLE

2,500 expert-crafted academic questions spanning math, natural sciences, and humanities, designed to be Google-proof and exceed frontier model capabilities. Published in Nature (Jan 2025).

Year2025

Why our crawl picked it up

Notes the discovery agent wrote when proposing this benchmark.

Directly addresses saturation of MMLU (>90% AI accuracy). Questions are written by domain-expert contributors and are unambiguous yet cannot be retrieved online. State-of-the-art models score <45%, making it one of the most discriminative frontier benchmarks available. Multi-modal variant included.

Source

Primary source ↗

This entry was added by an automated crawl and hasn't been curated yet. Once it's reviewed and promoted into the bundled set, you'll see task anatomy, examples, scores, and richer context here.