Benchmark AnalysisJanuary 9, 20237 min read

GPT-4 benchmark scores are insane. But let me show you the fine print.

Everyone is sharing GPT-4's bar exam score. Almost nobody is talking about the benchmarks where it barely beats GPT-3.5. I broke down all 23 benchmarks in the technical report. The picture is more mixed than the headlines suggest.

Let me be honest. When OpenAI dropped the GPT-4 technical report, I spent six hours reading it cover to cover, which I realize says something about me as a person.

Everyone saw the bar exam headline. 90th percentile. Incredible. But there are 23 benchmarks in that report, and the story they tell is way more interesting than "GPT-4 is amazing at everything."

It's not amazing at everything. It's amazing at some things, good at most things, and barely better than GPT-3.5 at a few things. Let me walk through the data.

The full benchmark table

Here's every benchmark from the technical report, with GPT-4's score, GPT-3.5's score, and the percentage improvement:

| Benchmark | GPT-4 score | GPT-3.5 score | Improvement | Category | |-----------|------------|---------------|-------------|----------| | Uniform Bar Exam | 298/400 (90th %ile) | 213/400 (10th %ile) | +39.9% | Professional exam | | LSAT | 163 (88th %ile) | 149 (40th %ile) | +9.4% | Professional exam | | SAT Math | 700 (89th %ile) | 590 (70th %ile) | +18.6% | Academic exam | | SAT Evidence-Based R&W | 710 (93rd %ile) | 670 (87th %ile) | +6.0% | Academic exam | | GRE Quantitative | 163 (80th %ile) | 157 (62nd %ile) | +3.7% | Academic exam | | GRE Verbal | 169 (99th %ile) | 154 (63rd %ile) | +9.7% | Academic exam | | AP Calculus BC | 4 (43-59th %ile) | 1 (0-7th %ile) | +300% (raw) | Academic exam | | AP English Language | 2 (14-44th %ile) | 2 (14-44th %ile) | 0% | Academic exam | | MMLU (5-shot) | 86.4% | 70.0% | +23.4% | ML benchmark | | HellaSwag (10-shot) | 95.3% | 85.5% | +11.5% | ML benchmark | | WinoGrande (5-shot) | 87.5% | 81.6% | +7.2% | ML benchmark | | ARC Challenge (25-shot) | 96.3% | 85.2% | +13.0% | ML benchmark | | HumanEval (0-shot) | 67.0% | 48.1% | +39.3% | Coding | | DROP (3-shot, F1) | 80.9 | 64.1 | +26.2% | Reading comprehension |

Source: OpenAI GPT-4 Technical Report, March 2023.

I cut the table to the benchmarks with clear GPT-3.5 comparisons, but you get the idea.

The pattern nobody is talking about

Look at the improvement column. The variance is wild.

On the bar exam, GPT-4 jumped from the 10th percentile to the 90th. That's the headline. That's what every news article led with.

But on AP English Language? Identical scores. Both models scored a 2 out of 5. GPT-4 didn't improve at all on that task.

And on SAT Evidence-Based Reading and Writing, GPT-4 improved by just 6%. Both models were already in the high 80th-90th percentile range. The ceiling effect is real here.

I sorted the benchmarks by improvement magnitude and a clear pattern emerged:

Where GPT-4 crushed GPT-3.5 (over 20% improvement):

  • Bar exam (+39.9%)
  • HumanEval coding (+39.3%)
  • AP Calculus BC (huge raw jump)
  • DROP reading comprehension (+26.2%)
  • MMLU (+23.4%)

Where GPT-4 barely moved the needle (under 10% improvement):

  • SAT Evidence-Based R&W (+6.0%)
  • WinoGrande (+7.2%)
  • GRE Quantitative (+3.7%)
  • AP English Language (0%)

The benchmarks where GPT-4 dominates are tasks that require multi-step reasoning, code generation, and formal logic. The benchmarks where it barely improves are tasks that test pattern recognition and common-sense language understanding.

My reading? GPT-3.5 was already near ceiling on simpler language tasks. The gains from scaling showed up primarily in harder reasoning tasks.

The MMLU breakdown tells the real story

The MMLU benchmark (Massive Multitask Language Understanding) tests 57 subjects. OpenAI published GPT-4's scores on selected subjects:

| MMLU Subject | GPT-4 accuracy | GPT-3.5 accuracy | Gap | |-------------|---------------|------------------|-----| | Abstract Algebra | 59% | 28% | +31 pts | | College Mathematics | 51% | 35% | +16 pts | | Formal Logic | 52% | 31% | +21 pts | | College Physics | 93% | 74% | +19 pts | | High School Biology | 95% | 75% | +20 pts | | High School US History | 89% | 78% | +11 pts | | Marketing | 88% | 84% | +4 pts | | Professional Psychology | 81% | 70% | +11 pts |

Source: OpenAI GPT-4 Technical Report, Figure 4.

Same pattern. The biggest improvements are in math and formal reasoning. The smallest are in subjects that rely on memorized facts and common knowledge.

This is important because it means GPT-4 isn't uniformly better. If you're using it for creative writing or basic factual Q&A, you might not notice much difference from GPT-3.5. If you're using it for code or math, the jump is massive.

The vision benchmarks are interesting but hard to compare

GPT-4 is multimodal. It can process images. OpenAI included vision benchmarks, but there's a problem: GPT-3.5 can't see images, so there's no direct comparison.

| Vision benchmark | GPT-4 score | Previous SOTA | |-----------------|------------|--------------| | VQAv2 | 77.2% | 77.4% (PaLI) | | TextVQA | 78.0% | 71.8% (PaLI) | | ChartQA | 78.1% | 70.2% (Pix2Struct) | | AI2 Diagram | 78.2% | 42.1% (Codex) |

Source: OpenAI GPT-4 Technical Report, Table 5.

On VQAv2, GPT-4 basically ties the previous best. On AI2 Diagram (understanding scientific diagrams), it nearly doubles the previous score. The vision capability is strong on specialized tasks, but it's not some massive leap on standard visual Q&A.

What they didn't tell us

Here's what bugs me. The technical report is deliberately vague about:

  1. Training data. No details about what GPT-4 was trained on. None. OpenAI says this is for "competitive and safety reasons." I get it, but it makes independent evaluation basically impossible.

  2. Model size. No parameter count. We don't know if GPT-4 is 500B parameters, 1 trillion, or something else. Rumors range from 200B to 1.8T. Without this number, we can't calculate training efficiency or compare against open source models fairly.

  3. Benchmark methodology. On some benchmarks, the report notes GPT-4 was tested "without vision." On others, it's unclear. The few-shot counts vary (5-shot, 10-shot, 25-shot, 0-shot). This makes cross-benchmark comparison tricky.

  4. Cherry-picked exam subjects. They showed AP Calculus BC (big improvement) but not all AP subjects. Selection bias is possible.

The LMSYS Chatbot Arena tells a different story

While OpenAI's benchmarks test GPT-4 on structured tasks, the LMSYS Chatbot Arena tests what real humans prefer. As of early 2023:

| Model | Arena Elo rating | |-------|-----------------| | GPT-4 | ~1250 | | Claude (v1) | ~1150 | | GPT-3.5-turbo | ~1120 | | Vicuna-13B | ~1050 | | Alpaca-13B | ~1000 |

Source: LMSYS Chatbot Arena, January-March 2023.

GPT-4 leads, but the gap between GPT-4 and GPT-3.5 in human preferences (~130 Elo points) is smaller than the gap the benchmarks would suggest. Humans notice GPT-4 is better, but they don't notice it's "bar exam: 10th to 90th percentile" better. The subjective improvement is real but more moderate than the benchmark numbers.

My take

GPT-4 is the best model available right now. That's not debatable. The benchmarks confirm it, the Chatbot Arena confirms it, and my own testing (which I'll publish soon) confirms it.

But the narrative that GPT-4 is a "giant leap" across the board is wrong. It's a giant leap in reasoning and code. It's a moderate leap in general knowledge. And it's barely a step forward in basic language understanding.

If you're deciding whether to pay 15x more for GPT-4 vs GPT-3.5 (which I'll analyze in a future article), the answer depends entirely on what you're doing with it. For a chatbot that answers customer questions? Probably not worth the premium. For a coding assistant or a math tutor? Absolutely worth it.

The fine print matters. It always does.


If you found this interesting, you might also like:

-- dataku

More from dataku