Benchmarks
Every major LLM benchmark explained — what it tests, how tasks work, and where models stand.
AgentBench
LLM agents across 8 interactive environments: OS, databases, web, games, and more.
AIME
Annual olympiad-level math competition used as a fresh, contamination-proof AI benchmark.
ARC-Challenge
Grade-school science questions that simple retrieval systems can't answer.
BIG-Bench Hard
23 hard reasoning tasks where chain-of-thought is required to exceed human performance.
GAIA
Multi-step real-world tasks that are conceptually simple for humans but require tool-using agents.
GPQA Diamond
PhD-level science questions so hard that even experts with Google still struggle.
GSM8K
Grade-school math word problems requiring 2-8 step arithmetic reasoning.
HellaSwag
Commonsense reasoning — pick the most plausible continuation of an everyday activity.
HumanEval / HumanEval+
Python function completion from docstrings, evaluated by test execution.
LiveBench
Contamination-resistant benchmark refreshed monthly from recent sources with no LLM judge.
LiveCodeBench
Contamination-resistant coding benchmark using freshly released competition problems.
LMSYS Chatbot Arena
Crowdsourced human preference Elo ratings from millions of real user comparisons.
MATH Benchmark
Competition-level math problems across 7 subjects, from AMC to AIME difficulty.
MMLU / MMLU-Pro
Broad academic knowledge across 57 subjects — the standard knowledge benchmark.
SWE-bench
Can AI resolve real GitHub issues on production codebases?
SWE-Lancer
Real Upwork freelance software tasks mapped to $1M in economic value.
TheAgentCompany
A simulated software company with 16 AI colleagues testing real office work tasks.
TruthfulQA
Can AI avoid repeating common myths and falsehoods that pervade its training data?
WebArena / VisualWebArena
Autonomous browser agents completing realistic tasks on functional sandboxed websites.
τ-bench (tau-bench)
AI customer service agents that must follow policy while solving real customer problems.