ARC-Challenge
Grade-school science questions that simple retrieval systems can't answer.
What It Tests
ARC (AI2 Reasoning Challenge) was released by the Allen Institute for AI (AI2) in 2018. The benchmark tests grade-school science question answering (grades 3-9) from standardized US tests. Its key design property: questions were selected specifically because simple retrieval-based systems and word co-occurrence algorithms both failed to answer them correctly.
ARC has two splits: the Easy set (questions that retrieval systems can answer) and the Challenge set (2,590 questions where both retrieval and word co-occurrence methods fail). The Challenge set is what is universally reported when people say 'ARC'.
When released in 2018, the best system scored only 33.7% — barely above the 25% random baseline. The benchmark was designed to require genuine scientific reasoning, not just keyword matching. It rapidly became one of the five core benchmarks in the Open LLM Leaderboard alongside HellaSwag, Winogrande, GSM8K, and TruthfulQA.
ARC-Challenge is now fully saturated. Top models score 95-98%, with o3 at 98.1%. The benchmark is retained primarily as a regression test and for evaluating smaller open-source models where it still discriminates (50-80% range). For frontier model comparison, it provides no useful signal.
Task Anatomy
How a single task is structured.
Example Tasks
2 real examples from the benchmark.
Ecology — Chipmunks and acorns
Problem / Input
Requires causal reasoning about ecological relationships, not just keyword matching. A retrieval system returning 'acorns chipmunks' wouldn't know which causal explanation is correct.
Physics — Shadow length
Problem / Input
A student placed a ruler on a table and shone a flashlight on it from directly above. Then the student shone the flashlight from the side. Which of the following describes the shadows made by the ruler in each case?
A) Both shadows were the same length.
B) The shadow from above was longer.
C) The shadow from the side was longer.
D) There were no shadows because the ruler is transparent.Tests 3D spatial reasoning about light and shadows — difficult for early 2018 NLP systems but trivial for humans and modern LLMs.
Leaderboard Results
Model scores sorted by performance.
7 results
| # | Model | Score |
|---|---|---|
| 1 | o3 | V98.1% |
| 2 | GPT-5 | V~97% |
| 3 | Opus 4.5 | V~96.5% |
| 4 | GPT-4o | V~96% |
| 5 | DeepSeek V3 | V95.2% |
| 6 | Qwen 3 72B | V~94% |
| 7 | Llama 4 Maverick | V92.3% |
V= Self-reported by the model's creator, not independently verified
Score Over Time
Performance progression across model generations.
Key Findings
ARC-Challenge was genuinely hard in 2018 (best: 33.7%) and is now trivially solved (98%+), illustrating how rapidly the field has advanced on structured reasoning tasks.
Designed with a key methodological innovation: using retrieval system failure as a selection criterion for difficult questions, creating a more robust 'hard' partition than subjective difficulty ratings.
Still useful as a baseline floor check for open-source model evaluation where scores range from 40-90% across the model population.
Controversies & Caveats
Known limitations and criticisms.
Completely saturated for frontier models — no meaningful discrimination above 90%.
Some ARC-Challenge questions have contested or ambiguous correct answers, especially in borderline earth science / ecology topics.
Used as a component of the Hugging Face Open LLM Leaderboard alongside HellaSwag, MMLU, and others — useful for open-source model ranking even when frontier models have saturated it.