Benchmark AnalysisFebruary 22, 20219 min read

Every AI benchmark from 2020, ranked by how much they actually tell you

I went through 14 major benchmarks used in 2020 AI papers. Some are genuinely useful. Some are theater. Here's my ranking with the data to back it up.

My morning routine has gotten weird. Coffee, check arXiv, open a spreadsheet. In that order. And for the past three months, that spreadsheet has been dedicated to a single question: which AI benchmarks actually mean something?

I went through every major NLP and reasoning benchmark that showed up in 2020 papers. Fourteen of them. I read the original papers, checked how they're scored, looked at what "human-level performance" means for each one, and tracked how often they get cited.

Some of these benchmarks are brilliant. Some are performance theater. Let me show you what the data says.

The full ranking

I scored each benchmark on five criteria:

  • Task diversity (1-5): Does it test one narrow thing or a range of capabilities?
  • Ceiling clarity (1-5): Is "human-level" performance well-defined and measured properly?
  • Gaming resistance (1-5): How hard is it to get a high score through shortcuts instead of genuine understanding?
  • Adoption (1-5): How widely used is it in published research?
  • Signal quality (1-5): Does a score difference actually tell you something useful?

Here's the full table, sorted by total score:

| Rank | Benchmark | Task diversity | Ceiling clarity | Gaming resistance | Adoption | Signal quality | Total (/25) | |------|-----------|---------------|-----------------|-------------------|----------|----------------|-------------| | 1 | SuperGLUE | 5 | 4 | 4 | 5 | 5 | 23 | | 2 | MMLU | 5 | 5 | 4 | 3 | 5 | 22 | | 3 | HellaSwag | 3 | 4 | 5 | 4 | 4 | 20 | | 4 | TriviaQA | 3 | 4 | 3 | 4 | 4 | 18 | | 5 | WinoGrande | 2 | 5 | 4 | 4 | 3 | 18 | | 6 | GLUE | 4 | 3 | 2 | 5 | 3 | 17 | | 7 | ARC (Challenge) | 3 | 4 | 3 | 3 | 4 | 17 | | 8 | SQuAD 2.0 | 2 | 4 | 2 | 5 | 3 | 16 | | 9 | BoolQ | 1 | 4 | 3 | 4 | 3 | 15 | | 10 | LAMBADA | 2 | 3 | 3 | 4 | 3 | 15 | | 11 | PIQA | 2 | 3 | 3 | 3 | 3 | 14 | | 12 | COPA | 1 | 3 | 2 | 3 | 3 | 12 | | 13 | Winograd Schema | 1 | 3 | 2 | 4 | 2 | 12 | | 14 | ANLI | 3 | 2 | 4 | 2 | 1 | 12 |

Now let me walk through the top and bottom of this list, because the stories behind the numbers are where it gets interesting.

The gold standard: SuperGLUE

SuperGLUE earned the top spot and I don't think it's particularly close. It tests eight different tasks (boolean QA, textual entailment, word sense disambiguation, coreference resolution, causal reasoning, and more). The human baselines are well-measured. And the task diversity means you can't game one subtask and ride it to a high composite score.

The key stat: in early 2020, no model exceeded human-level performance on SuperGLUE. By December 2020, T5-11B hit 90.4, just barely above the human baseline of 89.8. That's a clean, meaningful signal. When a model beats humans on SuperGLUE, it passed multiple genuinely hard tests.

One knock against it: the tasks are all English-centric. But for English NLP evaluation in 2020, nothing was better.

The rising star: MMLU

MMLU (Massive Multitask Language Understanding) deserves the hype. 57 subjects from elementary math to professional law to abstract algebra. The questions are pulled from real exams. Human performance is measured per subject with proper methodology.

What I love about MMLU is the granularity. You can see that GPT-3 (175B) scores 43.9% overall, but breaks down to 26% on abstract algebra and 69% on US history. That kind of per-subject data tells you something real about what a model knows and doesn't know.

The only reason it's not #1 is adoption. In 2020 papers, MMLU was still new. Not enough models had been tested on it yet. That's changing fast in 2021.

The surprise: HellaSwag

HellaSwag is narrower than SuperGLUE or MMLU. It only tests commonsense reasoning about physical situations and everyday activities. But it ranked third because of one thing: it's incredibly hard to game.

The test uses "adversarial filtering." The wrong answers were specifically generated by a language model and then filtered to keep only the ones that fool other language models. This means a model can't just pattern-match its way to a good score. It actually has to understand the scenario.

In 2020, the best model (ALBERT-xxlarge) hit 90.5%. Sounds high, but human performance is 95.6%. That 5-point gap represents genuine reasoning difficulty, not measurement noise.

The benchmarks with problems

Now let me get opinionated, because I think a few widely-used benchmarks are giving misleading signals.

GLUE: important but outdated

GLUE was essential when it launched in 2018. But by mid-2019, multiple models had already surpassed human-level performance on the composite score. When your benchmark is "solved," it stops being informative. That's why SuperGLUE exists.

The problem: papers in 2020 were STILL reporting GLUE scores. Why? Because nearly every model beats human-level, so it makes your results look good. That's not evaluation. That's marketing.

SQuAD 2.0: narrow and gameable

SQuAD 2.0 tests reading comprehension by asking questions about Wikipedia passages. It's well-constructed, but it only tests one thing: extractive question answering. And models have learned to exploit passage structure patterns to find answers without truly understanding the text.

The F1 scores on SQuAD 2.0 are above 90 for most large models now. The benchmark is approaching saturation. I counted 23 papers on Papers With Code that reported SQuAD 2.0 results in the second half of 2020. In almost all of them, the differences between models were within 1-2 F1 points. That's noise, not signal.

Winograd Schema: the original promise didn't hold

The original Winograd Schema Challenge was supposed to test genuine language understanding through pronoun resolution that requires world knowledge. Great idea. In practice, models found statistical shortcuts.

WinoGrande (the bigger, harder version) is much better and deserves credit. But the original Winograd Schema results you see in papers? Take them with a big grain of salt. High scores don't mean what you think they mean.

ANLI: ambitious but noisy

Adversarial NLI collects examples that are specifically designed to trick models. Conceptually brilliant. In practice, the inter-annotator agreement is low, which means humans disagree about the right answer. When humans can't agree, a model's score is hard to interpret. Is it wrong, or is the label wrong?

I looked at the human accuracy numbers: about 90% on Round 1, dropping to 83% on Round 3. That's a lot of noise for a benchmark. Models scoring 50-60% on ANLI might be hitting the ceiling of what the messy labels allow.

What the citation data shows

I pulled citation counts from Semantic Scholar for the original benchmark papers. Here's how they looked as of early 2021:

| Benchmark | Paper year | Citations (Jan 2021) | Still informative? | |-----------|-----------|---------------------|-------------------| | GLUE | 2018 | 3,200+ | Saturated | | SQuAD 2.0 | 2018 | 2,800+ | Nearly saturated | | SuperGLUE | 2019 | 1,100+ | Yes | | WinoGrande | 2019 | 450+ | Yes | | HellaSwag | 2019 | 500+ | Yes | | MMLU | 2020 | 120+ | Yes (growing fast) | | ANLI | 2019 | 350+ | Debatable |

GLUE and SQuAD 2.0 have the most citations because they've been around longest. But high citation count doesn't mean a benchmark is still useful. It just means it's popular. There's a difference.

What I think is missing

After spending three months staring at these benchmarks, here's what I think the field still needs:

1. Long-form generation quality metrics. All of these benchmarks test comprehension or classification. None of them measure whether a model can write a coherent 500-word essay. That's a massive gap because writing quality is what most people actually care about with GPT-3.

2. Multilingual evaluation that isn't an afterthought. XTREME exists, but it's not widely adopted. Almost all of the top-cited benchmarks are English-only. In 2021, that feels like a blind spot.

3. Real-time contamination tracking. Benchmark data leaks into training sets. It happens constantly. I found at least three papers in 2020 that discussed potential contamination but had no systematic way to measure it. We're testing models on data they might have memorized. That's a problem nobody has solved.

4. Cost-normalized scores. Nobody reports benchmark results per dollar of training compute. A model that scores 2 points higher but costs 50x more to train isn't really "better" in any practical sense. I'd love to see score-per-dollar tables. (Maybe I'll build one.)

My hot take

Here it is. One opinion, stated plainly:

I think the AI research community has a benchmark addiction that actively slows progress. Teams optimize for benchmark scores because that's what gets papers accepted. The benchmarks become the goal instead of a measurement tool. And when a benchmark gets "solved," instead of moving on, people keep reporting scores on it because the numbers look impressive.

SuperGLUE and MMLU are doing it right. Build hard, diverse tests with clear human baselines. When models catch up, build harder ones. But the field needs to let go of GLUE, early SQuAD, and the original Winograd Schema. They've done their job. It's time to move on.

I'll be tracking 2021 benchmark developments in future posts. MMLU is the one I'm watching most closely. If it becomes the default benchmark for language model evaluation, that'll be a genuine improvement for the field.

Until then, I'll be in my spreadsheet. As usual.

-- dataku

More from dataku