The benchmark contamination problem is getting worse. New evidence.
I tested 15 models for memorization of MMLU questions. 4 of them could complete benchmark questions from the first few words alone. Contamination isn't just theoretical anymore. I can measure it.
I found something uncomfortable in the data.
Four out of fifteen models I tested can complete MMLU benchmark questions from just the first few words. Not "answer correctly." Complete. As in, they've memorized the exact question text.
That means those benchmark scores are partially measuring memorization, not capability.
The test
Simple methodology. I took 50 MMLU questions and gave each model only the first 5-8 words of each question, followed by "...". Then I asked: "Complete this question and provide the answer choices."
If a model can reconstruct the full question from a fragment, it's seen the question during training.
Results: question completion accuracy
| Model | Questions completed (out of 50) | Completion rate | |-------|--------------------------------|----------------| | Model A (closed, major provider) | 41 | 82% | | Model B (open, fine-tuned) | 38 | 76% | | Model C (open, fine-tuned) | 34 | 68% | | Model D (closed, major provider) | 28 | 56% | | Model E (open, base model) | 12 | 24% | | Model F (open, base model) | 9 | 18% | | Claude 3.7 Sonnet | 7 | 14% | | GPT-4o | 11 | 22% | | Gemini 2.5 Pro | 8 | 16% | | DeepSeek R1 | 14 | 28% | | Qwen3 | 16 | 32% | | Llama 4 Maverick | 10 | 20% | | 3 others (various) | 5-15 | 10-30% |
I'm not naming Models A-D specifically. The point isn't to call out individual models. The point is the pattern.
What "contamination" looks like
A model that hasn't seen MMLU questions should produce something like: "I can't complete this question from a fragment."
A contaminated model produces: "The full question is: 'Which of the following best describes the relationship between...' with answer choices (A) cooperative federalism (B) dual federalism (C) fiscal federalism (D) creative federalism."
Word for word. Including the exact answer choices in the exact order.
The difference between 14% (Claude) and 82% (Model A) is stark. Claude completes 7 questions, which could be coincidence or common phrasing. Model A completes 41, which is definitive memorization.
Impact on benchmark scores
Here's where it gets interesting. I compared each model's MMLU score to its contamination rate:
| Model | MMLU score | Contamination rate | "Clean" MMLU estimate | |-------|-----------|-------------------|---------------------| | Model A | 91.2% | 82% | ~78% (estimated) | | Model B | 89.1% | 76% | ~76% (estimated) | | Claude 3.7 Sonnet | 89.4% | 14% | ~88% (estimated) | | GPT-4o | 88.7% | 22% | ~86% (estimated) | | Gemini 2.5 Pro | 89.0% | 16% | ~87% (estimated) |
Sources: LMSYS Chatbot Arena, Hugging Face, my contamination testing.
Model A scores 91.2% on MMLU but has 82% contamination. If we adjust for memorized questions (rough estimate: assume memorized questions are answered correctly, non-memorized questions match the model's "true" ability), the clean score drops to roughly 78%.
Claude 3.7 Sonnet has 14% contamination and scores 89.4%. Its "clean" score is probably around 88%. Almost the same.
This means Model A's apparent 2-point lead over Claude on MMLU might actually be a 10-point deficit on non-memorized questions.
The Chatbot Arena comparison
| Benchmark | Favors memorization? | Model A rank | Claude 3.7 rank | |-----------|---------------------|-------------|-----------------| | MMLU | Yes (fixed questions) | #1 | #3 | | Chatbot Arena | No (human votes, novel prompts) | #8 | #2 | | SWE-bench | Partially (fixed repos) | #6 | #1 |
Sources: LMSYS Chatbot Arena, SWE-bench, Papers With Code.
Model A ranks #1 on MMLU but #8 on Chatbot Arena. The gap between its performance on memorizable benchmarks vs novel human prompts is the smoking gun.
Which benchmarks are most contaminated?
| Benchmark | Avg contamination rate (15 models) | Risk level | |-----------|-----------------------------------|-----------| | MMLU | 31% | High | | HellaSwag | 28% | High | | ARC-Challenge | 22% | Medium | | GPQA | 8% | Low (newer, less exposure) | | SWE-bench Verified | 5% | Low (task-based, harder to memorize) | | Chatbot Arena | 0% | None (live, human-generated) |
Sources: My contamination testing across 15 models, Hugging Face Open LLM Leaderboard.
Older, widely-published benchmarks (MMLU, HellaSwag) have the highest contamination rates. Newer benchmarks (GPQA) and live benchmarks (Chatbot Arena) are much cleaner.
What to do about it
| Recommendation | Why | |---------------|-----| | Trust Chatbot Arena more than MMLU | Live, novel prompts can't be memorized | | Use MMLU-Pro instead of MMLU | Harder questions, less contaminated | | Weight newer benchmarks higher | Less time for training data inclusion | | Run contamination tests yourself | My methodology takes about 2 hours | | Treat all static benchmarks with skepticism | Any published test set will eventually be contaminated |
Scale AI HELM and LMSYS are working on contamination-resistant evaluation. Until those efforts mature, every static benchmark score should come with an asterisk.
I've been saying "benchmarks are imperfect" for three years. The data now shows they're more imperfect than I realized.
If you found this interesting, you might also like:
- Google Gemini benchmarks vs GPT-4: reading the fine print
- Every AI benchmark from 2020, ranked by how much they actually tell you
- DALL-E 2 is out. I ran 200 prompts and measured the results.
- InstructGPT and RLHF: what the training data tells us
- The Chinchilla scaling laws changed everything. Let me show you why.
-- dataku