The 'contamination' problem: when benchmarks stop meaning anything
I found evidence that at least 6 models on the Hugging Face leaderboard were trained on benchmark test data. When your test set is in the training data, your scores are meaningless. I built a simple check for this.
I've been suspicious about benchmark scores for months. In April I wrote about Goodhart's Law hitting the Hugging Face Open LLM Leaderboard. Since then, the problem has gotten worse. Much worse.
This time I have evidence. I found at least 6 models where benchmark test data appears to have leaked into the training set. Their scores are, to put it bluntly, meaningless.
What contamination looks like
The concept is simple. If a model sees the test questions during training, it "memorizes" the answers rather than learning the underlying skill. Its benchmark score goes up, but its actual capability doesn't.
Here's a real example. I took a model that scores well on MMLU and asked it two types of questions:
| Question type | Source | Model accuracy | |--------------|--------|---------------| | Exact MMLU test questions | MMLU dataset (public) | 74.2% | | Rephrased MMLU questions (same concept, different wording) | Written by me | 48.7% | | Novel questions at same difficulty level | Written by me | 51.3% |
A 25.5-point gap between the exact test questions and rephrased versions of those same questions. If the model truly understood the underlying concepts, rephrasing shouldn't matter. The drop means the model is pattern-matching on specific question text, not reasoning about the content.
My detection method
I built a simple contamination check that anyone can run. The idea: if a model is contaminated on a benchmark, it should score much higher on exact benchmark questions than on equivalent novel questions.
Steps:
- Pick 50 questions from the benchmark test set
- Rephrase each question to test the same concept with different wording
- Also write 50 completely new questions at the same difficulty level
- Run all 150 questions through the model
- Compare accuracy across the three groups
My threshold: if the accuracy gap between exact questions and rephrased questions is over 15 percentage points, I flag the model as likely contaminated.
Results across the leaderboard top 20
I ran this check on 15 of the top 20 models (5 were too large for me to test or didn't have accessible APIs). Testing on MMLU specifically:
| Model (anonymized) | Exact MMLU accuracy | Rephrased accuracy | Gap | Contaminated? | |-------------------|--------------------|-------------------|-----|---------------| | Model A | 71.4% | 44.2% | 27.2 pts | Very likely | | Model B | 68.9% | 51.3% | 17.6 pts | Likely | | Model C | 73.2% | 51.8% | 21.4 pts | Very likely | | Model D | 65.1% | 48.7% | 16.4 pts | Likely | | Model E | 67.3% | 52.1% | 15.2 pts | Borderline | | Model F | 69.8% | 46.5% | 23.3 pts | Very likely | | Model G | 62.4% | 55.8% | 6.6 pts | Clean | | Model H | 64.7% | 58.3% | 6.4 pts | Clean | | Model I | 66.1% | 60.2% | 5.9 pts | Clean | | Model J | 63.8% | 57.4% | 6.4 pts | Clean | | Model K | 61.2% | 56.8% | 4.4 pts | Clean | | Model L | 70.5% | 48.9% | 21.6 pts | Very likely | | Model M | 58.9% | 55.2% | 3.7 pts | Clean | | Model N | 66.3% | 60.1% | 6.2 pts | Clean | | Model O | 72.1% | 50.3% | 21.8 pts | Very likely |
6 out of 15 models show contamination signals. That's 40%.
I'm anonymizing the models because I want to focus on the systemic problem, not name and shame specific teams. (Some of these teams may not even know their training data includes benchmark questions; data provenance in the fine-tuning era is murky.)
How contamination happens
The contamination isn't always deliberate. Here are the common paths:
| Path | How it works | Deliberate? | |------|-------------|------------| | Training on benchmark data directly | Include MMLU/HellaSwag questions in fine-tuning set | Yes | | Training on web data that contains benchmarks | Common Crawl includes benchmark questions posted on forums | Often accidental | | Training on model-generated data | If GPT-4 was trained with benchmark awareness, its outputs may contain benchmark-like content | Indirect | | "Studying to the test" | Fine-tuning on data specifically formatted like benchmark questions | Gray area |
The web data path is insidious. MMLU questions have been posted on Reddit, Stack Exchange, Quora, and various forums. If your training data includes web crawls from these sources, you've accidentally contaminated your model. This is extremely hard to detect and prevent.
The LMSYS comparison
Here's why the LMSYS Chatbot Arena keeps looking better. I compared the rankings of contaminated vs clean models across both evaluation systems:
| Model | HF Leaderboard rank | LMSYS Arena rank (est.) | Discrepancy | |-------|---------------------|------------------------|-------------| | Model A (contaminated) | Top 5 | Top 15 | Overrated on HF | | Model C (contaminated) | Top 3 | Top 12 | Overrated on HF | | Model G (clean) | Top 10 | Top 8 | Consistent | | Model H (clean) | Top 12 | Top 10 | Consistent |
Clean models rank consistently across both systems. Contaminated models rank much higher on the Hugging Face leaderboard than in the Chatbot Arena. This is exactly what you'd expect: the leaderboard rewards memorization, the Arena rewards actual capability.
What should change
Three proposals:
1. Rotating benchmarks. Generate new test questions regularly so the test set isn't static and memorizable. Papers With Code and LMSYS are both moving in this direction.
2. Mandatory contamination checks. Before accepting a model to the leaderboard, run a rephrased question check. If the gap exceeds 15 points, flag the submission.
3. Training data disclosure. Require model submitters to list their training data sources. If the training data includes benchmark datasets (or web crawls likely to contain them), note that on the leaderboard.
None of these are perfect. But the current system, where 40% of top models may be contaminated and nobody checks, is actively misleading the community.
Why I care about this
Benchmarks are how we compare models. When benchmarks can be gamed, comparison becomes impossible. And when comparison becomes impossible, people make decisions based on marketing instead of data.
I started this blog because I believe data should drive decisions in AI. Contaminated benchmarks undermine that belief. If the numbers don't mean anything, what am I even doing here?
The fix isn't hard. It just requires caring enough to implement it. I hope the community does.
If you found this interesting, you might also like:
- Every AI benchmark from 2020, ranked by how much they actually tell you
- DALL-E 2 is out. I ran 200 prompts and measured the results.
- InstructGPT and RLHF: what the training data tells us
- The Chinchilla scaling laws changed everything. Let me show you why.
- I ran GPT-3 on the same 50 questions every month for a year. Here's the drift.
-- dataku