Independent LLM Benchmarks

How LLMs actually perform on real business tasks

MMLU and HumanEval test models in isolation. Real applications give models tools, structured outputs, and messy data. We benchmark leading LLMs on actual business work — same tools, same prompts, same ground truth. Only the model changes.

Subscribe to Updates

New benchmark results every month. Open data, open methodology, open code.

benchmark_run_047.log

09:12Loading benchmark: Invoice Extraction v1.2

09:12Dataset: 500 invoices, 12 formats, 4 languages

09:12Harness: Vercel AI SDK + 3 tools (extract, validate, format)

09:12Ground truth: human-annotated, cross-validated

09:13Running Claude Opus 4.7........... ✓

09:15Running GPT-5..................... ✓

09:17Running Gemini 3 Pro.............. ✓

09:19Running Qwen 3 72B................ ✓

09:21Running DeepSeek v3............... ✓

09:22Evaluation complete. Results ready.

results_invoice_extraction.json

Benchmark Results — Invoice Extraction v1.2

Model	Acc.	Tool	Cost	p95
Claude Opus 4.7	94.2%	98.1%	$0.031	1.2s
GPT-5	91.8%	96.4%	$0.042	0.9s
Gemini 3 Pro	89.5%	94.2%	$0.018	1.8s
Qwen 3 72B	84.1%	88.7%	$0.008	2.4s
DeepSeek v3	82.7%	85.3%	$0.006	2.1s

Leading models per cycle

500+

Tasks per benchmark

Monthly

Updated with every major release

The Problem

The gap between benchmark scores and production results.

Every month a new model claims top scores on MMLU, HumanEval, or GPQA. Six weeks later you wire it into your product and the numbers don't hold. Why?

Because academic benchmarks test raw model capability in isolation. Production doesn't work that way. Production gives the model tools, structured output schemas, messy real-world inputs, latency budgets, and cost constraints. A model that aces HumanEval might hallucinate tool arguments. A model that wins GPQA might cost 10x more per task for 2% better accuracy.

We test models the way you'll use them — with real tools, real data, real tradeoffs.

Academic benchmarksMMLU · HumanEval · GPQA

Controlled prompts, no tools
Single-turn, no structured output
No cost or latency constraints

Production workActualBench

Tool use & structured output schemas
Messy real-world inputs
Cost per task & p95 latency tracked

Methodology

The same harness, the same tools, the same ground truth. Only the model changes.

Standardize

Harness

Every model runs inside the same framework: Vercel AI SDK with identical tool definitions, prompt structure, and output schemas. No model-specific optimizations. The only variable is the LLM.

Measure

What matters

Accuracy is one number. We also track tool use correctness, hallucination rate, cost per successful task, p50 and p95 latency, and failure modes. Production teams need all of it.

Publish

Everything

Datasets, eval scripts, prompts, tool definitions, full per-model results — all on GitHub. Reproduce our numbers. Challenge our methodology. Submit improvements.

Benchmark Categories

Tasks that actually run your business.

Document

Document Extraction

Invoice line items, contract clauses, receipt parsing, multi-page report summarization. Tool-assisted structured output from unstructured inputs.

Classification

Text Classification

Support ticket intent, email routing, lead qualification against ICP criteria, content moderation. Accuracy plus calibration.

Extraction

Data Enrichment

Entity extraction from free text, CRM field population, contact parsing, address normalization. Precision, recall, hallucination rate.

Matching

Search & Matching

Product search against catalogs, hotel matching against supplier databases, lead deduplication. Top-k accuracy on realistic scale.

Generation

Structured Generation

Email drafting, response templates, report writing with data citations. Coherence plus factual accuracy plus schema adherence.

Tool Use

Tool Selection & Reasoning

Pick the right tool from many. Fill in arguments correctly. Chain calls, handle partial results, recover from failures. The plumbing underneath every production agent.

Who This Is For

Built for people shipping LLMs to production.

Engineering

AI Engineers & ML Teams

Pick the right model for your task. See tool use failure modes before they hit your users. Compare cost-adjusted accuracy, not just benchmark scores.

Leadership

CTOs & Heads of AI

Evaluate vendors with empirical data instead of marketing decks. Know what $0.03/task actually buys you. Justify model choice with numbers your team can defend.

Research

Researchers & Practitioners

Reproduce our results. Cite our datasets. Submit new tasks. Build on open methodology.

Open By Design

Everything we measure, we publish.

No paywalls. No locked reports. No vendor-friendly omissions. Download the datasets. Read the eval harness. Reproduce the numbers on your own setup. Submit improvements.

Research the way research should work.

Browse on GitHub

DatasetsCSV, Parquet, HuggingFace

Eval harnessTypeScript, Vercel AI SDK, MIT license

Tool definitionsExact schemas used across all models

Full resultsPer-model, per-task, per-run

MethodologyVersioned, dated, documented

Reproducibility kitDocker, exact model versions

About

An independent research project.

ActualBench was founded by Dmitry Loza, PhD researcher and founder of Mitryco, an AI agency building production LLM applications for business operations.

The work here comes from a recurring frustration: every new model claims breakthrough performance, and then you wire it into production and the numbers don't match. Benchmarks test exam knowledge. Production tests tool use, structured output, and cost. We decided to measure what actually happens when these systems meet real work.

Stay Current

New benchmarks every month.

Subscribe to get new results, methodology updates, and model comparisons the day they drop. No marketing fluff. Just data.

One email per month. Unsubscribe anytime.