BigLaw Bench
Real legal work quality — what percent of a lawyer-quality deliverable does AI produce?
Overview
BigLaw Bench was created by Harvey AI, a legal AI startup serving top BigLaw firms and Fortune 500 legal departments. The benchmark directly answers: 'What percentage of a lawyer-quality work product does the model complete for the user?' — a question tied directly to commercial AI value in legal practice.
Unlike academic benchmarks (LegalBench, bar exam questions) that test narrow legal reasoning in structured formats, BigLaw Bench evaluates open-ended, long-form professional legal work: drafting contracts, analyzing litigation filings, researching case law, reviewing due diligence documents. Tasks are derived from actual billable work performed by lawyers and evaluated against rubrics with both positive criteria (requirements met) and negative criteria (errors, hallucinations, omissions).
The scoring formula is novel: (positive points earned − negative points for errors) ÷ total positive points = quality score. This penalizes confident hallucinations — a model that produces fluent but wrong legal citations scores lower than one that acknowledges uncertainty.
BigLaw Bench has expanded significantly since launch. BLB: Global (early 2026) added UK, Australia, and Spain law in collaboration with Mercor. BLB: Research added hard agentic legal research problems requiring real search tools. BLB: Arena (November 2025) provides an Elo-based leaderboard where practicing lawyers choose between AI outputs. A 5x+ expansion is planned throughout 2026.
How It Works
Task setup, inputs, outputs, and evaluation.
Example Tasks
Real tasks from this evaluation system.
Transactional — Risk Assessment
Corporate Law
Review a 120-page Software Licensing Agreement and identify the top five contractual risk areas for the licensee.
What the Agent Receives
What It Must Produce
How Success Is Judged
Litigation — Opposition Brief Outline
Litigation
Review a motion for summary judgment in a securities fraud case and draft a 3-page opposition outline.
What the Agent Receives
What It Must Produce
How Success Is Judged
Research — BLB: Research (agentic)
Legal Research
Find the controlling 9th Circuit precedent on attorney-client privilege for in-house counsel communications. Identify the standard, key cases, and circuit splits.
What the Agent Receives
What It Must Produce
How Success Is Judged
Results
Model performance on this evaluation.
| # | Model | Score |
|---|---|---|
| 1 | GPT-5 | V89.22% |
| 2 | Opus 4.5 | V~87% |
| 3 | Gemini 2.5 Pro | V85.02% |
| 4 | o3 | V84.13% |
| 5 | GPT-4o | V~62% |
Key Findings
The jump from ~62% (GPT-4o era, 2024) to ~87-89% (GPT-5 era, 2025) is one of the fastest improvements documented on any professional work benchmark — suggesting legal drafting and analysis was a particularly high-leverage capability area.
Harvey AI's proprietary systems (trained on legal data and RLHF from lawyer feedback) outperform base GPT-5 by 10-15 percentage points on BigLaw Bench, demonstrating that domain-specific fine-tuning on legal data significantly improves professional quality.
Source Score is the most discriminating metric: models that hallucinate legal citations score dramatically lower than models that produce accurate citations or acknowledge uncertainty — a particularly important distinction in legal practice.
BLB: Arena (Elo from lawyer preference votes) shows Harvey's proprietary model is preferred over baseline GPT-5 70% of the time and over Claude 4.5 59%+ of the time — validating that domain-specific training provides perceivable quality improvements to expert practitioners.
What Makes It Unique
- ✓
Negative-point scoring: hallucinating legal citations actively hurts your score — unlike most benchmarks where wrong answers simply fail to earn points.
- ✓
Source accuracy as distinct metric: legal work requires citations that are real, correctly quoted, and applicable. Separating 'answer quality' from 'citation accuracy' reveals model failure modes invisible on other benchmarks.
- ✓
Created by practitioners: Harvey AI serves actual BigLaw firms and Fortune 500 legal departments — tasks reflect what paying clients actually need, not what researchers think is legally interesting.
- ✓
International expansion via Mercor: BLB: Global adding UK, Australian, and Spanish law in collaboration with Mercor demonstrates how domain benchmarks can scale through expert data partnerships.
Controversies & Caveats
Harvey AI has a commercial incentive to benchmark legal AI — the benchmark validates their product category and serves as marketing for Harvey's own services.
Limited public access to the actual benchmark tasks — methodology is described but tasks are not publicly released, limiting independent verification.
Rubric scoring can vary by evaluator — legal quality judgments have inherent subjectivity, especially in complex transactional analysis.
Harvey's own proprietary models are included in BLB: Arena evaluations, creating a conflict of interest in the evaluation that validates their commercial product.