The LLM leaderboard is dead, long live the leaderboard
Hugging Face deprecated the Open LLM Leaderboard v1 and launched v2 with new benchmarks. I compared scores on both versions for 20 models. Some models dropped 15 points. The re-ranking is dramatic and some "top models" were just benchmark-optimized.
The Hugging Face Open LLM Leaderboard v1 is dead. Officially deprecated. v2 uses harder benchmarks, and the rankings look completely different.
I compared 20 models across both versions. The shake-up is revealing.
The benchmark changes
| v1 benchmarks | v2 replacements | Why changed | |--------------|----------------|-------------| | MMLU (5-shot) | MMLU-Pro (10-shot, harder) | MMLU was saturated, heavily contaminated | | HellaSwag | IFEval | HellaSwag too easy, gamed | | ARC-Challenge | BBH (Big Bench Hard) | ARC was solvable by pattern matching | | TruthfulQA | MUSR (Multi-step reasoning) | TruthfulQA had known exploits | | Winogrande | MATH (Level 5) | Winogrande saturated | | GSM8K | GPQA | GSM8K too easy for modern models |
Sources: Hugging Face v2 announcement, Papers With Code.
Every v1 benchmark was either saturated (multiple models scoring 90%+), contaminated (training on test data), or both. v2 replaces all six with harder alternatives.
The re-ranking
| Model | v1 rank | v1 avg score | v2 rank | v2 avg score | Rank change | |-------|---------|-------------|---------|-------------|-------------| | DeepSeek R1 | 3 | 87.2 | 1 | 74.8 | +2 | | Claude Opus 4 (via API) | N/A | N/A | 2 | 73.6 | N/A | | Qwen3 235B | 1 | 89.4 | 4 | 71.2 | -3 | | Llama 4 Maverick | 5 | 84.1 | 3 | 72.1 | +2 | | Model X (fine-tune) | 2 | 88.8 | 14 | 62.4 | -12 | | Model Y (fine-tune) | 4 | 85.6 | 18 | 58.1 | -14 | | Mistral Large 3 | 8 | 82.3 | 6 | 69.4 | +2 | | Phi-4 14B | 12 | 78.4 | 8 | 67.2 | +4 |
Sources: Hugging Face Open LLM Leaderboard v1 (archived) and v2.
The most dramatic drops are the fine-tuned models. "Model X" went from rank #2 to rank #14. "Model Y" from #4 to #18. These were models specifically optimized for v1 benchmarks. On harder, less-contaminated benchmarks, they fell apart.
DeepSeek R1 jumped from #3 to #1. Its reasoning capability gives it a structural advantage on harder benchmarks (MATH Level 5, GPQA, MUSR) that can't be gamed as easily.
Qwen3 235B dropped from #1 to #4. Its v1 dominance was partly built on MMLU and ARC performance that doesn't translate to the harder v2 benchmarks.
Score drops by model type
| Model type | Avg v1 score | Avg v2 score | Avg drop | |-----------|-------------|-------------|----------| | Base models (not fine-tuned) | 82.4 | 68.1 | -14.3 | | Fine-tuned on benchmark data | 87.1 | 61.2 | -25.9 | | Reasoning-optimized models | 85.8 | 73.4 | -12.4 |
Fine-tuned models dropped 25.9 points on average. Base models dropped 14.3. Reasoning models dropped only 12.4.
The interpretation: fine-tuning on benchmark data inflated v1 scores by about 11 points on average. On harder benchmarks, that inflation disappears.
What v2 actually measures
| Benchmark | What it tests | Contamination risk | |-----------|-------------|-------------------| | MMLU-Pro | Hard multiple-choice across 14 subjects | Lower (harder questions, less published) | | IFEval | Instruction following precision | Low (constraint-based, hard to game) | | BBH | Multi-step reasoning across 23 tasks | Medium (published but hard) | | MUSR | Multi-step soft reasoning | Low (new benchmark) | | MATH Level 5 | Competition-level math | Low (verified, hard to memorize solutions) | | GPQA | Graduate-level science | Low (expert-authored, small test set) |
v2 benchmarks are collectively harder to game because they test deeper reasoning, not pattern matching.
My take
The v1 to v2 transition is the best thing that's happened to open source model evaluation in years. v1 had become a vanity metric. Models were being optimized to score well on the leaderboard, not to be genuinely useful.
v2 isn't perfect (no benchmark is). But it's a meaningful upgrade in what gets measured.
The models that dropped the most are the ones that were most heavily optimized for v1 metrics. The models that held steady are the ones that were genuinely good at reasoning and following instructions. That's exactly what you want a leaderboard to reveal.
My advice: if you were choosing models based on v1 rankings, re-evaluate. The v2 rankings better predict real-world usefulness based on my testing.
If you found this interesting, you might also like:
- Every AI benchmark from 2020, ranked by how much they actually tell you
- DALL-E 2 is out. I ran 200 prompts and measured the results.
- InstructGPT and RLHF: what the training data tells us
- The Chinchilla scaling laws changed everything. Let me show you why.
- I ran GPT-3 on the same 50 questions every month for a year. Here's the drift.
-- dataku