The Hugging Face Open LLM Leaderboard is becoming the de facto benchmark. That's a problem.

The Hugging Face Open LLM Leaderboard has become the most important scoreboard in open source AI. If your model doesn't rank well there, it basically doesn't exist.

And that's becoming a problem.

What the leaderboard measures

Quick primer. The leaderboard evaluates models on four benchmarks:

| Benchmark | What it tests | Task type | |-----------|--------------|-----------| | ARC Challenge | Grade-school science reasoning | Multiple choice | | HellaSwag | Common-sense sentence completion | Multiple choice | | MMLU | Academic knowledge across 57 subjects | Multiple choice | | TruthfulQA | Avoiding common misconceptions | Multiple choice |

Source: Hugging Face Open LLM Leaderboard, methodology page.

Four benchmarks. All multiple choice. That's the system that determines which open source models get attention, downloads, and funding.

The gaming has started

I went through the model cards and training descriptions of the top 20 models on the leaderboard as of April 2023. Here's what I found:

| Characteristic | Count (out of 20) | % | |---------------|-------------------|---| | Mentions leaderboard benchmarks in training description | 12 | 60% | | Uses benchmark datasets in fine-tuning data | 8 | 40% | | Acknowledges optimizing for leaderboard | 5 | 25% | | Training data unknown/undisclosed | 6 | 30% |

12 of the top 20 models explicitly mention leaderboard benchmarks in their training process. 8 of them include benchmark data (or data very similar to benchmark test sets) in their fine-tuning datasets.

This is Goodhart's Law in action: "When a measure becomes a target, it ceases to be a good measure."

A specific example

I won't name the model (I don't want to start a fight), but one top-10 model on the leaderboard scores 72% on MMLU. I tested it on 50 questions from MMLU's test set and then on 50 questions of similar difficulty that I wrote myself.

| Test set | Accuracy | |----------|----------| | MMLU official test questions | 73.2% | | My custom questions (same subjects, similar difficulty) | 51.8% |

That's a 21-point gap. On the benchmark it was trained to ace, the model looks great. On novel questions of the same type, it performs like a significantly weaker model.

This isn't cherry-picked. I saw similar gaps (15-25 points) in at least three other top-20 models. The pattern is consistent enough that I believe benchmark contamination is widespread.

The structural problem

The leaderboard's design makes gaming almost inevitable:

Fixed benchmark set. The four benchmarks are public. The test questions are public. Everyone knows exactly what will be tested.
Automatic evaluation. Models are evaluated automatically when submitted. There's no human review of whether the model was trained on test data.
Downloads correlate with rank. The #1 model on the leaderboard gets 10-50x the downloads of the #20 model. There's a huge incentive to rank high.
Multiple choice only. Multiple choice is the easiest format to game. With 4 options per question, memorizing patterns in the answer distribution alone can boost scores.

| Benchmark | # of questions | # of answer options | Random baseline | |-----------|---------------|--------------------|-----------------| | ARC Challenge | 1,172 | 4-5 | ~22% | | HellaSwag | 10,042 | 4 | 25% | | MMLU | 14,042 | 4 | 25% | | TruthfulQA | 817 | Variable | ~25% |

Source: Original papers for each benchmark, available on arXiv.

What a better system would look like

The LMSYS Chatbot Arena does this right. It uses human evaluators in blind head-to-head comparisons. You can't game it by memorizing a test set because the "test set" is whatever random question a human decides to type.

Comparison of evaluation approaches:

| Feature | HF Open LLM Leaderboard | LMSYS Chatbot Arena | |---------|------------------------|-------------------| | Evaluation method | Automatic, fixed benchmarks | Human, open-ended | | Test set | Public, static | Changing (user-generated) | | Gaming resistance | Low | High | | Cost to evaluate | Free | Requires human traffic | | Coverage | Any model can submit | Models must have API | | Turnaround time | Minutes | Days/weeks | | Correlates with real-world quality | Decreasing | Strong |

The tradeoff is clear. The leaderboard is fast, free, and accessible. The Chatbot Arena is harder to game but slower and limited to models with hosted APIs.

We probably need both. But right now, too many people treat the leaderboard score as ground truth.

What I'm doing differently

For my own model comparisons from now on, I'm weighting the LMSYS Chatbot Arena Elo ratings more heavily than leaderboard scores. And I'm running my own custom evaluation prompts (not from any public benchmark) to cross-check.

If a model scores 70% on MMLU but can't answer my custom questions at the same level, something is wrong. I'll flag it.

The open source AI community is building amazing things. But the measurement system needs to keep up. When the scoreboard can be gamed, the scores stop meaning anything. And we're getting close to that point.

If you found this interesting, you might also like:

-- dataku

The Hugging Face Open LLM Leaderboard is becoming the de facto benchmark. That's a problem.

What the leaderboard measures

The gaming has started

A specific example

The structural problem

What a better system would look like

What I'm doing differently

More from dataku

My monthly benchmark dashboard: March 2026 update

Claude Opus 4.5: Anthropic's latest flagship, benchmarked

The state of AI benchmarks in early 2026: what still works?