Benchmarks

Every major LLM benchmark explained — what it tests, how tasks work, and where models stand.

Sort:

Agent TasksActive

2023

AgentBench

LLM agents across 8 interactive environments: OS, databases, web, games, and more.

AIME

Annual olympiad-level math competition used as a fresh, contamination-proof AI benchmark.

ARC-Challenge

Grade-school science questions that simple retrieval systems can't answer.

2,590 tasks

98.1%

ReasoningNearing Saturation

2022

BIG-Bench Hard

23 hard reasoning tasks where chain-of-thought is required to exceed human performance.

GAIA

Multi-step real-world tasks that are conceptually simple for humans but require tool-using agents.

GPQA Diamond

PhD-level science questions so hard that even experts with Google still struggle.

GSM8K

Grade-school math word problems requiring 2-8 step arithmetic reasoning.

HellaSwag

Commonsense reasoning — pick the most plausible continuation of an everyday activity.

HumanEval / HumanEval+

Python function completion from docstrings, evaluated by test execution.

164 tasks

97.6%

Contamination-ResistantActive

2024

LiveBench

Contamination-resistant benchmark refreshed monthly from recent sources with no LLM judge.

LiveCodeBench

Contamination-resistant coding benchmark using freshly released competition problems.

1,055 tasks

91.7%

Human PreferenceActive

2023

LMSYS Chatbot Arena

Crowdsourced human preference Elo ratings from millions of real user comparisons.

6,000,000 tasks

1549 Elo (Coding)

MathNearing Saturation

2021

MATH Benchmark

Competition-level math problems across 7 subjects, from AMC to AIME difficulty.

MMLU / MMLU-Pro

Broad academic knowledge across 57 subjects — the standard knowledge benchmark.

SWE-bench

Can AI resolve real GitHub issues on production codebases?

SWE-Lancer

Real Upwork freelance software tasks mapped to $1M in economic value.

TheAgentCompany

A simulated software company with 16 AI colleagues testing real office work tasks.

TruthfulQA

Can AI avoid repeating common myths and falsehoods that pervade its training data?

WebArena / VisualWebArena

Autonomous browser agents completing realistic tasks on functional sandboxed websites.

τ-bench (tau-bench)

AI customer service agents that must follow policy while solving real customer problems.

200 tasks

84.7%