← Back to BenchmarksPrimary source ↗
codingNEWPending curation
BigCodeBench
1,140 practical coding tasks requiring invocation of diverse library APIs across 139 libraries; models score ≤60% vs. 97% human baseline. Accepted as ICLR 2025 oral.
Year2024
Why our crawl picked it up
Notes the discovery agent wrote when proposing this benchmark.
Targets the realism gap in HumanEval/MBPP: tests real-world library usage (data analysis, web dev, NLP tools) rather than self-contained algorithmic puzzles. Includes BigCodeBench-Hard and BigCodeBench-Instruct variants. 5.6 test cases per task with 99% branch coverage.
Source
This entry was added by an automated crawl and hasn't been curated yet. Once it's reviewed and promoted into the bundled set, you'll see task anatomy, examples, scores, and richer context here.