Benchmark AnalysisJanuary 13, 20265 min read

The state of AI benchmarks in early 2026: what still works?

MMLU is saturated. HumanEval is gamed. SWE-bench has contamination issues. I reviewed 20 active benchmarks and rated each on reliability, relevance, and resistance to gaming. Only 4 scored above 7/10. Chatbot Arena is still the gold standard.

I reviewed 20 active AI benchmarks and rated each on three dimensions: reliability (consistent results), relevance (measures what matters), and resistance to gaming (hard to overfit). Each dimension scored 1-10.

The results are sobering. Most benchmarks are broken or breaking.

The 20 benchmarks, ranked

| Benchmark | Reliability | Relevance | Resistance to gaming | Overall (avg) | |-----------|------------|-----------|---------------------|--------------| | LMSYS Chatbot Arena | 9 | 9 | 10 | 9.3 | | GPQA Diamond | 8 | 8 | 8 | 8.0 | | LiveCodeBench | 8 | 9 | 7 | 8.0 | | SWE-bench Verified | 7 | 9 | 7 | 7.7 | | MMLU-Pro | 7 | 7 | 6 | 6.7 | | IFEval | 8 | 7 | 5 | 6.7 | | MATH (Level 5) | 7 | 6 | 6 | 6.3 | | BBH (Big Bench Hard) | 6 | 7 | 6 | 6.3 | | MUSR | 7 | 6 | 7 | 6.7 | | ChartQA | 7 | 7 | 5 | 6.3 | | MMLU | 5 | 5 | 3 | 4.3 | | HumanEval | 5 | 5 | 3 | 4.3 | | HellaSwag | 3 | 3 | 2 | 2.7 | | ARC-Challenge | 4 | 3 | 3 | 3.3 | | GSM8K | 3 | 3 | 2 | 2.7 | | TruthfulQA | 4 | 4 | 3 | 3.7 | | Winogrande | 3 | 2 | 2 | 2.3 | | AIME | 6 | 5 | 7 | 6.0 | | Codeforces | 7 | 6 | 8 | 7.0 | | Scale AI HELM | 7 | 7 | 6 | 6.7 |

Sources: My assessment based on: contamination testing, score saturation analysis, correlation with real-world usefulness, methodology review. Papers With Code, benchmark papers.

The 4 benchmarks I still trust

1. Chatbot Arena (9.3/10)

Still the gold standard. Human voters comparing blind model outputs. Can't be gamed because the prompts are novel and the evaluation is human preference, not a fixed answer key.

Limitation: biased toward conversational quality. Not great at measuring specialized capabilities (code debugging, scientific reasoning).

2. GPQA Diamond (8.0/10)

Graduate-level science questions written by PhD experts. Hard enough that most humans score below 35%. Low contamination because the question set is small and carefully guarded.

Limitation: small test set (~198 questions). Noisy at the individual model level.

3. LiveCodeBench (8.0/10)

Coding problems from recent competitive programming contests. New problems are added regularly, making contamination difficult. Tests actual coding ability, not memorized solutions.

Limitation: competitive programming isn't representative of real-world coding.

4. SWE-bench Verified (7.7/10)

Real bugs from real open source projects. The "Verified" subset filters out easy problems. Tests end-to-end coding: reading a codebase, understanding the bug, writing a fix.

Limitation: scaffolding matters as much as the model. Contamination risk increases as the test repos are well-known.

The benchmarks I no longer trust

| Benchmark | Problem | When it broke | |-----------|---------|-------------| | MMLU | 82% contamination in some models, scores saturated above 91% | Mid-2024 | | HumanEval | Published solutions widely available, scores above 97% | Early 2025 | | HellaSwag | Saturated (99%+ scores), trivially solvable | 2024 | | GSM8K | Elementary math, scores above 97%, widely contaminated | 2024 | | ARC-Challenge | Saturated, pattern-matchable | 2024 |

Sources: My contamination testing (article from May 2025), Hugging Face, model scores over time.

These benchmarks aren't "wrong." They just don't differentiate between models anymore. When 8 models score 90%+ on MMLU, the benchmark has stopped being useful for comparison.

What makes a good benchmark in 2026?

| Property | Why it matters | Best examples | |----------|---------------|---------------| | Evolving (new problems added) | Prevents contamination | Chatbot Arena, LiveCodeBench | | Expert-authored | Harder to game than scraped/generated | GPQA | | Task-based (not multiple choice) | Tests capability, not pattern matching | SWE-bench | | Human evaluation | Captures nuance that scoring rubrics miss | Chatbot Arena | | Versioned and updated | Stays relevant as models improve | HF Leaderboard v2 |

The ideal benchmark is evolving, expert-authored, task-based, and human-evaluated. Chatbot Arena is the closest to this ideal, which is why it's been the most trusted benchmark for two years running.

The benchmark situation is messy. But the signal is still there if you know which instruments to trust.


If you found this interesting, you might also like:

-- dataku

More from dataku