How LLMs actually perform on real business tasks
MMLU and HumanEval test models in isolation. Real applications give models tools, structured outputs, and messy data. We benchmark leading LLMs on actual business work — same tools, same prompts, same ground truth. Only the model changes.
New benchmark results every month. Open data, open methodology, open code.
| Model | Acc. | Tool | Cost | p95 |
|---|---|---|---|---|
| Claude Opus 4.7 | 94.2% | 98.1% | $0.031 | 1.2s |
| GPT-5 | 91.8% | 96.4% | $0.042 | 0.9s |
| Gemini 3 Pro | 89.5% | 94.2% | $0.018 | 1.8s |
| Qwen 3 72B | 84.1% | 88.7% | $0.008 | 2.4s |
| DeepSeek v3 | 82.7% | 85.3% | $0.006 | 2.1s |
6
Leading models per cycle
500+
Tasks per benchmark
Monthly
Updated with every major release
The gap between benchmark scores and production results.
Every month a new model claims top scores on MMLU, HumanEval, or GPQA. Six weeks later you wire it into your product and the numbers don't hold. Why?
Because academic benchmarks test raw model capability in isolation. Production doesn't work that way. Production gives the model tools, structured output schemas, messy real-world inputs, latency budgets, and cost constraints. A model that aces HumanEval might hallucinate tool arguments. A model that wins GPQA might cost 10x more per task for 2% better accuracy.
We test models the way you'll use them — with real tools, real data, real tradeoffs.
- Controlled prompts, no tools
- Single-turn, no structured output
- No cost or latency constraints
- Tool use & structured output schemas
- Messy real-world inputs
- Cost per task & p95 latency tracked
The same harness, the same tools, the same ground truth. Only the model changes.
Standardize
Harness
Every model runs inside the same framework: Vercel AI SDK with identical tool definitions, prompt structure, and output schemas. No model-specific optimizations. The only variable is the LLM.
Measure
What matters
Accuracy is one number. We also track tool use correctness, hallucination rate, cost per successful task, p50 and p95 latency, and failure modes. Production teams need all of it.
Publish
Everything
Datasets, eval scripts, prompts, tool definitions, full per-model results — all on GitHub. Reproduce our numbers. Challenge our methodology. Submit improvements.
Tasks that actually run your business.
Document Extraction
Invoice line items, contract clauses, receipt parsing, multi-page report summarization. Tool-assisted structured output from unstructured inputs.
Text Classification
Support ticket intent, email routing, lead qualification against ICP criteria, content moderation. Accuracy plus calibration.
Data Enrichment
Entity extraction from free text, CRM field population, contact parsing, address normalization. Precision, recall, hallucination rate.
Search & Matching
Product search against catalogs, hotel matching against supplier databases, lead deduplication. Top-k accuracy on realistic scale.
Structured Generation
Email drafting, response templates, report writing with data citations. Coherence plus factual accuracy plus schema adherence.
Tool Selection & Reasoning
Pick the right tool from many. Fill in arguments correctly. Chain calls, handle partial results, recover from failures. The plumbing underneath every production agent.
Built for people shipping LLMs to production.
AI Engineers & ML Teams
Pick the right model for your task. See tool use failure modes before they hit your users. Compare cost-adjusted accuracy, not just benchmark scores.
CTOs & Heads of AI
Evaluate vendors with empirical data instead of marketing decks. Know what $0.03/task actually buys you. Justify model choice with numbers your team can defend.
Researchers & Practitioners
Reproduce our results. Cite our datasets. Submit new tasks. Build on open methodology.
Everything we measure, we publish.
No paywalls. No locked reports. No vendor-friendly omissions. Download the datasets. Read the eval harness. Reproduce the numbers on your own setup. Submit improvements.
Research the way research should work.
An independent research project.
ActualBench was founded by Dmitry Loza, PhD researcher and founder of Mitryco, an AI agency building production LLM applications for business operations.
The work here comes from a recurring frustration: every new model claims breakthrough performance, and then you wire it into production and the numbers don't match. Benchmarks test exam knowledge. Production tests tool use, structured output, and cost. We decided to measure what actually happens when these systems meet real work.
New benchmarks every month.
Subscribe to get new results, methodology updates, and model comparisons the day they drop. No marketing fluff. Just data.
One email per month. Unsubscribe anytime.