Illustration for: AI Eval Costs Surge, Threatening Independent Model Oversight

AI Eval Costs Surge, Threatening Independent Model Oversight

AI model evaluation costs now rival or exceed training expenses for some model types, creating what researchers call an “accountability barrier” to independent oversight, the EvalEval Coalition reported Tuesday.

The report, authored by Avijit Ghosh, Yifan Mai, Georgia Channing and Leshem Choshen of the EvalEval Coalition and published on Hugging Face’s blog, documents evaluation costs spanning from hundreds to hundreds of thousands of dollars per benchmark run — figures that put comprehensive, independent AI evaluation out of reach for most academic institutions and safety organizations.

The Numbers

The analysis catalogs costs across three tiers of AI benchmarks. Running the Holistic Agent Leaderboard, or HAL, costs roughly $40,000 for a single pass across nine models and nine benchmarks. When statistical reliability is factored in — requiring eight reruns per test cell to account for agent variance — that figure climbs to approximately $320,000, according to the report.

Individual agent benchmarks carry their own price tags. A single evaluation run of OpenAI’s PaperBench costs about $9,500 per agent across 20 papers. GAIA, another agent benchmark, runs $2,829 for one frontier model evaluation. The researchers found that in one case, a benchmark configuration costing nine times more than an alternative yielded only a two-percentage-point accuracy improvement.

For scientific machine learning benchmarks that require training models from scratch during evaluation, costs are even steeper relative to compression potential. The Well benchmark requires roughly 3,840 H100 GPU-hours — about $9,600 — for a full architecture sweep, the researchers found.

A Reversal of the Training-Eval Ratio

Historically, training AI models consumed the vast majority of compute budgets, with evaluation representing a negligible cost. That equation has now flipped for certain model categories, according to the analysis.

“Evaluation costs ‘may even surpass those of pretraining when evaluating checkpoints,'” the researchers wrote. “For small models, evaluation becomes the dominant compute line item across the whole development cycle.”

The report traces this shift to three factors: the rise of agentic AI systems requiring expensive multi-turn rollouts rather than single forward passes; the need for repeated runs to establish statistical reliability; and vast cost differences between model providers, with pricing spanning two orders of magnitude from budget to frontier-tier API access.

Compression Has Limits

While researchers have achieved cost reductions for traditional static benchmarks — compressing MMLU from 14,000 to 100 items with roughly 2 percent error, and achieving 100- to 200-fold compute reduction on Stanford’s HELM benchmark — those gains do not translate to newer evaluation paradigms, the analysis found.

Agent benchmarks can be compressed by a factor of only two to 3.5 times using mid-difficulty filtering techniques. Training-in-the-loop benchmarks, which require building models from scratch as part of the evaluation, are “essentially incompressible,” the researchers wrote.

Reliability Concerns Multiply Costs

The report highlights a reliability issue that compounds the cost problem. Agent evaluations show high variance between runs, with one benchmark showing performance dropping from 60 percent on a single run to 25 percent when measured across eight consistent runs. Achieving statistical confidence requires multiplying all costs by roughly eight times.

The researchers also flagged quality issues with existing benchmarks. Seven of 17 benchmarks on HAL had no holdout set for detecting data leakage, and environmental errors affected approximately 40 percent of runs on some benchmarks.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *