The Hugging Face Open LLM Leaderboard is becoming the de facto benchmark. That's a problem.
Every open source model now optimizes for the Hugging Face leaderboard. I checked: 12 of the top 20 models were specifically fine-tuned on leaderboard benchmark data. Goodhart's Law is hitting AI benchmarks hard.
The Hugging Face Open LLM Leaderboard has become the most important scoreboard in open source AI. If your model doesn't rank well there, it basically doesn't exist.
And that's becoming a problem.
What the leaderboard measures
Quick primer. The leaderboard evaluates models on four benchmarks:
| Benchmark | What it tests | Task type | |-----------|--------------|-----------| | ARC Challenge | Grade-school science reasoning | Multiple choice | | HellaSwag | Common-sense sentence completion | Multiple choice | | MMLU | Academic knowledge across 57 subjects | Multiple choice | | TruthfulQA | Avoiding common misconceptions | Multiple choice |
Source: Hugging Face Open LLM Leaderboard, methodology page.
Four benchmarks. All multiple choice. That's the system that determines which open source models get attention, downloads, and funding.
The gaming has started
I went through the model cards and training descriptions of the top 20 models on the leaderboard as of April 2023. Here's what I found:
| Characteristic | Count (out of 20) | % | |---------------|-------------------|---| | Mentions leaderboard benchmarks in training description | 12 | 60% | | Uses benchmark datasets in fine-tuning data | 8 | 40% | | Acknowledges optimizing for leaderboard | 5 | 25% | | Training data unknown/undisclosed | 6 | 30% |
12 of the top 20 models explicitly mention leaderboard benchmarks in their training process. 8 of them include benchmark data (or data very similar to benchmark test sets) in their fine-tuning datasets.
This is Goodhart's Law in action: "When a measure becomes a target, it ceases to be a good measure."
A specific example
I won't name the model (I don't want to start a fight), but one top-10 model on the leaderboard scores 72% on MMLU. I tested it on 50 questions from MMLU's test set and then on 50 questions of similar difficulty that I wrote myself.
| Test set | Accuracy | |----------|----------| | MMLU official test questions | 73.2% | | My custom questions (same subjects, similar difficulty) | 51.8% |
That's a 21-point gap. On the benchmark it was trained to ace, the model looks great. On novel questions of the same type, it performs like a significantly weaker model.
This isn't cherry-picked. I saw similar gaps (15-25 points) in at least three other top-20 models. The pattern is consistent enough that I believe benchmark contamination is widespread.
The structural problem
The leaderboard's design makes gaming almost inevitable:
-
Fixed benchmark set. The four benchmarks are public. The test questions are public. Everyone knows exactly what will be tested.
-
Automatic evaluation. Models are evaluated automatically when submitted. There's no human review of whether the model was trained on test data.
-
Downloads correlate with rank. The #1 model on the leaderboard gets 10-50x the downloads of the #20 model. There's a huge incentive to rank high.
-
Multiple choice only. Multiple choice is the easiest format to game. With 4 options per question, memorizing patterns in the answer distribution alone can boost scores.
| Benchmark | # of questions | # of answer options | Random baseline | |-----------|---------------|--------------------|-----------------| | ARC Challenge | 1,172 | 4-5 | ~22% | | HellaSwag | 10,042 | 4 | 25% | | MMLU | 14,042 | 4 | 25% | | TruthfulQA | 817 | Variable | ~25% |
Source: Original papers for each benchmark, available on arXiv.
What a better system would look like
The LMSYS Chatbot Arena does this right. It uses human evaluators in blind head-to-head comparisons. You can't game it by memorizing a test set because the "test set" is whatever random question a human decides to type.
Comparison of evaluation approaches:
| Feature | HF Open LLM Leaderboard | LMSYS Chatbot Arena | |---------|------------------------|-------------------| | Evaluation method | Automatic, fixed benchmarks | Human, open-ended | | Test set | Public, static | Changing (user-generated) | | Gaming resistance | Low | High | | Cost to evaluate | Free | Requires human traffic | | Coverage | Any model can submit | Models must have API | | Turnaround time | Minutes | Days/weeks | | Correlates with real-world quality | Decreasing | Strong |
The tradeoff is clear. The leaderboard is fast, free, and accessible. The Chatbot Arena is harder to game but slower and limited to models with hosted APIs.
We probably need both. But right now, too many people treat the leaderboard score as ground truth.
What I'm doing differently
For my own model comparisons from now on, I'm weighting the LMSYS Chatbot Arena Elo ratings more heavily than leaderboard scores. And I'm running my own custom evaluation prompts (not from any public benchmark) to cross-check.
If a model scores 70% on MMLU but can't answer my custom questions at the same level, something is wrong. I'll flag it.
The open source AI community is building amazing things. But the measurement system needs to keep up. When the scoreboard can be gamed, the scores stop meaning anything. And we're getting close to that point.
If you found this interesting, you might also like:
- Every AI benchmark from 2020, ranked by how much they actually tell you
- DALL-E 2 is out. I ran 200 prompts and measured the results.
- InstructGPT and RLHF: what the training data tells us
- The Chinchilla scaling laws changed everything. Let me show you why.
- I ran GPT-3 on the same 50 questions every month for a year. Here's the drift.
-- dataku