Harvey AI Releases Legal Agent Benchmark to Measure Real-World Legal Workflows

Harvey AI just dropped its Legal Agent Benchmark (LAB), an open-source evaluation framework purpose-built for measuring AI agents on real-world legal workflows. The benchmark contains over 1,200 agent tasks spanning 24 legal practice areas, evaluated against more than 75,000 expert-written rubric criteria. Unlike existing legal AI benchmarks that focus on short-horizon reasoning—answering questions about contracts or comparing case law—LAB is designed to test agents on end-to-end client matters where lawyers would normally delegate work to associates.

Why Legal Needed Its Own Agent Benchmark

The Harvey team points out that coding agent benchmarks like SWE-Bench Pro and Terminal-Bench 2.0 served as leading indicators of capability improvements, a pattern now extending beyond software development. Existing legal benchmarks including LegalBench, CUAD, and LEXam grade narrow tasks rather than long-horizon work product generation. "Agent scores on SWE-Bench Pro reflected a step-function improvement around the same time our engineering team started to feel the shift in practice," the post notes, citing Andrej Karpathy's observation that coding agents 'basically didn't work before December and basically work since.' LAB aims to provide this same legible progress index for legal work.

How LAB Mirrors Actual Legal Work

LAB uses a client matter-centric structure where each task maps directly to how work gets assigned and reviewed at big law firms. Instructions are written as partner-to-associate requests averaging just 50 words—deliberately vague, requiring agents to discover relevant documents within a closed-universe file system. One example M&A task asks an agent to analyze change-of-control provisions for a fictional $458 million acquisition of Crestview Software Solutions, where the data room contains eight material contracts plus adjacent materials like 10-Ks and compensation plans that may or may not be relevant to the analysis.

All-Pass Grading: No Partial Credit

The verification methodology uses expert rubrics with atomic binary pass/fail criteria covering facts, citations, severity ratings, financial exposure calculations, formatting choices, and recommendations. A task is only marked complete if every criterion passes—Harvey calls this "all-pass grading." For the change-of-control task, 57 criteria span nine legal issues planted across the matter. "A deal-team report that identifies eight of ten risks is not 80% useful; it is materially incomplete," the post states flatly. "The missing issue could change deal economics, require the analysis to be redone before closing, or surface as a problem after the deal closes."

Community-First Open Source Approach

LAB marks Harvey's first fully open-source benchmark, and they're deliberately launching without a leaderboard. The team wants to work with the community on baseline results before publishing comparative scores, arguing that normalized submission standards will let people track improvements over time as the benchmark itself evolves with new tasks and practice areas. Early adopters are already using LAB for post-training research, auto-research exploration, memory optimization, domain-specific skill development, and harness tuning for long-horizon work.

Looking Ahead: Beyond Biglaw

The roadmap extends beyond law firms entirely. Future releases plan coverage of in-house legal counsel workflows, plus adjacent knowledge-work domains like asset management, banking, and tax professionals. The 24 practice areas in this initial release represent a representative sample of transactional, advisory, regulatory, and litigation work—not an exhaustive map. Harvey is explicitly calling for collaboration from lawyers who can validate tasks, law firms that want to shape how agent evaluation works, legal technologists building domain-specific tooling, and AI labs interested in post-training models that produce reliable legal work product.

Key Takeaways

LAB's all-pass grading model reflects real legal review standards—no partial credit for 80% correct answers when the missing 20% could blow a deal
The client matter-centric structure tests whether agents can navigate loose instructions and discover relevant materials across complex file systems, not just answer narrow questions
Harvey is deliberately skipping a launch leaderboard to establish normalized submission standards first, inviting community feedback before competitive comparisons

The Bottom Line

LAB fills a real gap between toy legal benchmarks and the actual high-stakes work product law firms need. The all-pass grading approach is exactly right—if an AI misses a change-of-control risk that blows up post-closing, 95% accuracy isn't impressive. Whether open-source community contributions can keep this benchmark honest and evolving faster than lawyers can game it will determine whether LAB actually becomes the legible progress index Harvey wants it to be.

> Harvey AI Releases Legal Agent Benchmark to Measure Real-World Legal Workflows