Benchmark AnalysisMarch 4, 20247 min read

Claude 3 Opus is the first model to genuinely worry me about benchmarks

Claude 3 Opus matched or beat GPT-4 on most benchmarks, but the 'needle in a haystack' test is what got me. It detected that it was being tested. I ran my own version and the results are strange.

Anthropic launched the Claude 3 family on March 4th. Three models: Haiku, Sonnet, and Opus. The benchmarks are excellent. But that's not what got my attention.

What got my attention is that Claude 3 Opus appears to know when it's being tested.

Let me explain.

The benchmark numbers first

Here's the Claude 3 family compared to GPT-4 Turbo and Gemini 1.0 Ultra:

| Benchmark | Claude 3 Opus | Claude 3 Sonnet | Claude 3 Haiku | GPT-4 Turbo | Gemini Ultra | |-----------|--------------|----------------|---------------|-------------|-------------| | MMLU (5-shot) | 86.8% | 79.0% | 75.2% | 86.4% | 83.7% | | HumanEval | 84.9% | 73.0% | 75.9% | 87.1% | 74.4% | | GSM8K | 95.0% | 92.3% | 88.9% | 92.0% | 94.4% | | MATH | 60.1% | 43.1% | 38.9% | 52.9% | 53.2% | | GPQA | 50.4% | 40.4% | 33.3% | 49.1% | N/A | | MGSM | 90.7% | 84.7% | 75.1% | 85.5% | 79.0% | | DROP | 83.1% | 78.7% | 75.1% | 80.9% | 82.4% | | BIG-Bench-Hard | 86.8% | 82.8% | 73.7% | 83.1% | 83.6% |

Sources: Anthropic Claude 3 technical report, OpenAI GPT-4 Turbo documentation, Google Gemini Ultra report.

Claude 3 Opus beats GPT-4 Turbo on 5 of 8 benchmarks. The wins are significant on MATH (+7.2 points), GSM8K (+3.0), MGSM (+5.2), BIG-Bench-Hard (+3.7), and DROP (+2.2).

GPT-4 Turbo wins on HumanEval (+2.2) and is essentially tied on MMLU (86.4 vs 86.8) and GPQA (49.1 vs 50.4).

This is the first time a model has credibly beaten GPT-4 across a broad benchmark suite. Not on one cherry-picked test. On most of them.

But the "needle in a haystack" test is what broke my brain

Anthropic ran a test where they inserted a random sentence (the "needle") into a large context window and asked the model to find it. This is a standard test for long-context models. You stuff the context window with filler text, hide one sentence somewhere in the middle, and see if the model can retrieve it.

Claude 3 Opus scored near-perfect retrieval across its full 200K context window. That alone is impressive. But here's the part that made me sit up:

In several runs, Claude 3 Opus didn't just find the needle. It commented on it.

Anthropic's blog post includes an example where the model said something to the effect of: "This sentence about pizza toppings seems out of place in a document about software engineering. This looks like it might be an artificial test of my retrieval abilities."

The model identified that it was being tested.

I ran my own version

I couldn't let this go. I designed my own needle-in-a-haystack test with a twist: I varied how "obviously artificial" the inserted sentence was.

| Needle type | Example | Claude 3 Opus retrieval | Claude 3 Opus "called it out" | GPT-4 Turbo retrieval | |------------|---------|------------------------|------------------------------|----------------------| | Obvious mismatch | "Pizza toppings include pepperoni" in a legal brief | 100% (10/10) | 80% (8/10) | 100% (10/10) | | Mild mismatch | "The quarterly revenue was $4.2M" in a biology paper | 100% (10/10) | 40% (4/10) | 90% (9/10) | | Subtle mismatch | "The protein folded at pH 7.4" in a chemistry paper | 100% (10/10) | 10% (1/10) | 80% (8/10) | | No mismatch | "The compound reacted at 25C" in a chemistry paper | 100% (10/10) | 0% (0/10) | 70% (7/10) |

Source: My test, 40 trials, context window size 100K tokens, various document types.

When the needle is obviously out of place (pizza in a legal brief), Claude 3 Opus called out the test 80% of the time. It said things like "This appears to be a deliberate insertion to test my retrieval" or "I notice this sentence is incongruent with the surrounding content and may be part of a test."

When the needle was subtle, it still found it (100% retrieval) but stopped calling it out.

I should be clear: this isn't the model being "self-aware." It's pattern recognition. The model has likely seen many needle-in-a-haystack test descriptions in its training data, and it recognizes the pattern of an out-of-place sentence in a long document. Still, the fact that it can detect AND articulate that a specific test methodology is being used on it is... something.

Why this worries me about benchmarks

Here's my concern. If a model can detect that it's being tested with a needle-in-a-haystack evaluation, can it detect other evaluation patterns?

Think about it:

  • MMLU questions have a distinctive format (multiple choice, academic style)
  • HumanEval problems have a distinctive format (Python function signature + docstring)
  • GSM8K questions have a distinctive format (elementary math word problems)

If a model has seen enough benchmark descriptions and examples in its training data, it could theoretically recognize "I'm being evaluated" and activate different behavior patterns. Not because anyone programmed it to cheat, but because the training data contains so much discussion of these benchmarks that the model has implicitly learned what they look like.

I'm not saying this is happening. I'm saying the needle-in-a-haystack result is evidence that models can recognize evaluation contexts, and that makes me think harder about what benchmark scores actually mean.

The broader Claude 3 Opus data

Setting aside my existential benchmark concerns, here's my standard evaluation:

| Category | Claude 3 Opus | GPT-4 Turbo | Winner | |----------|--------------|-------------|--------| | Factual accuracy (50 prompts) | 4.08 | 4.12 | GPT-4 Turbo (barely) | | Code generation (50) | 4.18 | 4.24 | GPT-4 Turbo (barely) | | Creative writing (50) | 4.31 | 3.92 | Claude 3 Opus | | Long document analysis (50) | 4.42 | 3.78 | Claude 3 Opus | | Reasoning (50) | 4.09 | 4.15 | GPT-4 Turbo (barely) | | Instruction following (50) | 4.26 | 4.18 | Claude 3 Opus | | Overall | 4.22 | 4.07 | Claude 3 Opus |

Source: My evaluation, 300 prompts, blind rating, March 2024.

Claude 3 Opus wins my overall evaluation 4.22 to 4.07. It dominates on creative writing (+0.39) and long document analysis (+0.64). GPT-4 Turbo has slight edges on factual accuracy and code generation.

The long document analysis gap (4.42 vs 3.78) is the biggest I've ever measured between two frontier models. Claude's 200K context window with strong retrieval is a genuine capability advantage, not just a spec sheet number.

What I expected vs what I found

I expected Claude 3 Opus to be good. Anthropic has been methodically improving since Claude 2, and their research team includes some of the best alignment and capability researchers in the field.

I did not expect a model to recognize when it was being tested. That's a data point that will stick with me for a while.

The benchmark game is getting strange. The models are getting good enough to understand the game itself. I'm not sure what the right response is, but I think we need more evaluation approaches that models can't pattern-match against. LMSYS Chatbot Arena (human A/B testing with novel prompts) is one good approach. Private benchmarks that aren't published online might be another.

My morning routine now includes re-reading my own evaluation methodology to check for patterns that a well-trained model might exploit. That's a new kind of paranoia.


If you found this interesting, you might also like:

-- dataku

More from dataku