Google Gemini benchmarks vs GPT-4: reading the fine print

Google DeepMind just announced Gemini, and the headline is impressive: Gemini Ultra beats GPT-4 on 30 of 32 benchmarks.

I spent all day reading the technical report. Here's what the headlines aren't telling you.

The headline numbers

Google's published comparison:

| Benchmark | Gemini Ultra | GPT-4 | Winner | |-----------|-------------|-------|--------| | MMLU | 90.0% | 86.4% | Gemini (+3.6) | | HellaSwag (10-shot) | 87.8% | 95.3% | GPT-4 (+7.5) | | GSM8K (maj@32) | 94.4% | 92.0% | Gemini (+2.4) | | MATH (4-shot) | 53.2% | 42.5% | Gemini (+10.7) | | HumanEval (0-shot, it) | 74.4% | 67.0% | Gemini (+7.4) | | Natural2Code | 74.9% | 73.9% | Gemini (+1.0) | | BIG-Bench Hard | 83.6% | 83.1% | Gemini (+0.5) | | DROP (3-shot, F1) | 82.4 | 80.9 | Gemini (+1.5) |

Source: Google's Gemini technical report, December 2023.

Looks decisive, right? Gemini wins on most benchmarks, sometimes by significant margins (MATH +10.7 points!).

But I have questions.

Problem 1: Which GPT-4?

The most important footnote in the report: Google compared Gemini Ultra against GPT-4's launch-day performance (March 2023). OpenAI has updated GPT-4 multiple times since then.

| GPT-4 version | Approximate date | MMLU score | Source | |---------------|-----------------|------------|--------| | GPT-4 (launch) | March 2023 | 86.4% | OpenAI technical report | | GPT-4 (June update) | June 2023 | ~87% | Community testing | | GPT-4 Turbo | November 2023 | ~87-88% | Community testing, OpenAI |

Source: OpenAI technical report and community evaluations via LMSYS Chatbot Arena.

If the current GPT-4 Turbo scores 87-88% on MMLU, Gemini Ultra's 90.0% lead shrinks from 3.6 points to 2-3 points. Still a lead, but not the same story.

Google compared against an 8-month-old version of GPT-4. They had access to GPT-4 Turbo (which launched November 6, a month before the Gemini announcement). They chose not to compare against it.

Problem 2: The MMLU methodology

This is the one that really caught my attention.

Google's MMLU score of 90.0% uses a technique they call "CoT@32." That means:

Chain-of-thought prompting (the model explains its reasoning step by step)
32 samples per question (generate 32 answers, take the majority vote)

OpenAI's reported MMLU score of 86.4% uses standard 5-shot prompting. One answer per question. No chain of thought.

These are not the same evaluation methodology.

| Method | Gemini Ultra MMLU | GPT-4 MMLU | Notes | |--------|------------------|------------|-------| | 5-shot (standard) | 83.7% | 86.4% | Apples to apples | | CoT@32 | 90.0% | Not reported | Google's headline number |

Source: Gemini technical report, Table 1 and footnotes.

When you compare using the same methodology (5-shot), GPT-4 actually beats Gemini Ultra on MMLU: 86.4% vs 83.7%.

Google buried Gemini's 5-shot score. The headline number (90.0%) uses a different, more favorable methodology. I had to read the footnotes to find the 83.7% number.

Problem 3: HellaSwag

HellaSwag is the one benchmark where GPT-4 clearly wins in Google's own report (95.3% vs 87.8%, a 7.5-point gap). That's a significant deficit for Gemini on a standard common-sense reasoning benchmark.

Google's report doesn't explain why HellaSwag is so much lower. It's the biggest gap in the entire table. I suspect HellaSwag's format (sentence completion with adversarial distractors) is something GPT-4's architecture handles better, but without more detail from Google, I'm speculating.

Problem 4: The benchmarks they chose

Google included 32 benchmarks. That's a lot. But the selection isn't neutral. Several benchmarks in the table are Google-created or heavily used in Google research:

| Benchmark | Created by | Used in Google papers | |-----------|-----------|----------------------| | BIG-Bench | Google (primarily) | Extensively | | WMT 23 | Community (Google contributes) | Yes | | Natural2Code | Google | Yes | | MMMU | Community | Less |

This doesn't mean the benchmarks are biased. But when you choose which 32 benchmarks to include in your comparison, you're making editorial decisions. Including benchmarks your model was specifically optimized for is a form of selection bias.

For comparison, the LMSYS Chatbot Arena tests real user preferences without pre-selected benchmarks. Gemini Ultra hasn't been widely available in the Arena yet, so we don't have crowdsourced comparison data.

Problem 5: "Ultra" isn't available yet

The most important practical detail: Gemini Ultra, the version that beats GPT-4, isn't available to anyone. As of December 2023, only Gemini Pro is accessible (via Bard and the API). Gemini Pro is roughly GPT-3.5 level.

| Model | Available now? | Who has access | |-------|---------------|---------------| | Gemini Nano | Yes | Pixel 8 Pro (on-device) | | Gemini Pro | Yes | Bard, Google Cloud API | | Gemini Ultra | No | Coming "early 2024" |

So the benchmark comparison is between a model you can use today (GPT-4 Turbo) and a model you can't use yet (Gemini Ultra). That's... a choice.

What the data actually tells us

If I recompile the comparison using apples-to-apples methodology:

| Benchmark | Gemini Ultra (std method) | GPT-4 (std method) | Winner | |-----------|--------------------------|--------------------|---------| | MMLU (5-shot) | 83.7% | 86.4% | GPT-4 (+2.7) | | HellaSwag (10-shot) | 87.8% | 95.3% | GPT-4 (+7.5) | | HumanEval (0-shot) | 74.4% | 67.0% | Gemini (+7.4) | | MATH (4-shot, standard) | ~46% (est.) | 42.5% | Gemini (~+3.5) | | GSM8K (5-shot, standard) | ~89% (est.) | 92.0% | GPT-4 (~+3.0) |

Sources: Gemini technical report, OpenAI GPT-4 technical report, my estimates for standard methodology where Google reported only CoT@32.

With standard methodology, it's much closer. GPT-4 wins on MMLU, HellaSwag, and likely GSM8K. Gemini wins on HumanEval and MATH. It's a genuine competition, not the blowout the headlines suggest.

My take

I'm not saying Gemini Ultra is bad. It looks like a legitimately strong model. The coding scores are impressive. The multimodal capabilities (not covered here, but significant) are real.

What I am saying: the way Google presented the benchmarks was misleading. Comparing against 8-month-old GPT-4 using a non-standard methodology and burying the standard-methodology numbers in footnotes is... well, it's marketing, not data science.

When Gemini Ultra actually ships and the community can test it, we'll know where it really stands. Until then, I'm treating Google's numbers with the same skepticism I apply to any self-reported benchmarks.

Show me the LMSYS Arena scores. Those are the numbers I trust.

If you found this interesting, you might also like:

-- dataku