ReasoningSaturated

ARC-Challenge

Grade-school science questions that simple retrieval systems can't answer.

Tasks2,590

Year2018

CreatorPeter Clark, Isaac Cowhey, Oren Etzioni, et al.

MetricAccuracy

Random Chance25%

What It Tests

ARC (AI2 Reasoning Challenge) was released by the Allen Institute for AI (AI2) in 2018. The benchmark tests grade-school science question answering (grades 3-9) from standardized US tests. Its key design property: questions were selected specifically because simple retrieval-based systems and word co-occurrence algorithms both failed to answer them correctly.

ARC has two splits: the Easy set (questions that retrieval systems can answer) and the Challenge set (2,590 questions where both retrieval and word co-occurrence methods fail). The Challenge set is what is universally reported when people say 'ARC'.

When released in 2018, the best system scored only 33.7% — barely above the 25% random baseline. The benchmark was designed to require genuine scientific reasoning, not just keyword matching. It rapidly became one of the five core benchmarks in the Open LLM Leaderboard alongside HellaSwag, Winogrande, GSM8K, and TruthfulQA.

ARC-Challenge is now fully saturated. Top models score 95-98%, with o3 at 98.1%. The benchmark is retained primarily as a regression test and for evaluating smaller open-source models where it still discriminates (50-80% range). For frontier model comparison, it provides no useful signal.

Task Anatomy

How a single task is structured.

InputA multiple-choice science question (4 answer options) from US standardized science tests for grades 3-9. Questions were specifically selected because retrieval-based systems could not answer them.

OutputModel selects one answer choice (A, B, C, or D).

EvaluationAccuracy. Random baseline = 25%.

MetricAccuracy

Example Tasks

2 real examples from the benchmark.

Ecology — Chipmunks and acorns

Medium

Problem / Input

One year, the oak trees in a park began producing more acorns than usual. The next year, the population of chipmunks in the park also increased. Which of the following best explains why there were more chipmunks the following year? A) There were more predators of chipmunks in the park. B) The chipmunks ate more acorns and stored more food. C) The chipmunks had more food sources, allowing more to survive. D) The weather in the park changed, bringing more chipmunks.

AnswerC) The chipmunks had more food sources, allowing more to survive.

Requires causal reasoning about ecological relationships, not just keyword matching. A retrieval system returning 'acorns chipmunks' wouldn't know which causal explanation is correct.

Physics — Shadow length

Hard (for 2018 systems)

Problem / Input

A student placed a ruler on a table and shone a flashlight on it from directly above. Then the student shone the flashlight from the side. Which of the following describes the shadows made by the ruler in each case?

A) Both shadows were the same length.
B) The shadow from above was longer.
C) The shadow from the side was longer.
D) There were no shadows because the ruler is transparent.

AnswerC) The shadow from the side was longer.

Tests 3D spatial reasoning about light and shadows — difficult for early 2018 NLP systems but trivial for humans and modern LLMs.

Leaderboard Results

Model scores sorted by performance.

7 results

Sort:

#	Model	Score	Setup	Date
1	o3OpenAI	V98.1%	—	2025-04
2	GPT-5OpenAI	V~97%	—	2025-08
3	Opus 4.5Anthropic	V~96.5%	—	2025-11
4	GPT-4oOpenAI	V~96%	—	2024-11
5	DeepSeek V3DeepSeek	V95.2%	—	2024-12
6	Qwen 3 72BAlibaba	V~94%	—	2025-05
7	Llama 4 MaverickMeta	V92.3%	—	2025-04

V= Self-reported by the model's creator, not independently verified

Score Over Time

Performance progression across model generations.

Key Findings

ARC-Challenge was genuinely hard in 2018 (best: 33.7%) and is now trivially solved (98%+), illustrating how rapidly the field has advanced on structured reasoning tasks.
Designed with a key methodological innovation: using retrieval system failure as a selection criterion for difficult questions, creating a more robust 'hard' partition than subjective difficulty ratings.
Still useful as a baseline floor check for open-source model evaluation where scores range from 40-90% across the model population.

Controversies & Caveats

Known limitations and criticisms.

⚠

Completely saturated for frontier models — no meaningful discrimination above 90%.

⚠

Some ARC-Challenge questions have contested or ambiguous correct answers, especially in borderline earth science / ecology topics.

⚠

Used as a component of the Hugging Face Open LLM Leaderboard alongside HellaSwag, MMLU, and others — useful for open-source model ranking even when frontier models have saturated it.

Links

arXiv Paper ↗Official Leaderboard ↗Dataset ↗