Benchmark AnalysisFebruary 3, 20264 min read

Claude Opus 4.5: Anthropic's latest flagship, benchmarked

Anthropic's newest model. I ran 300 prompts across coding, reasoning, writing, and analysis. Coding scores are the highest I've measured from any model. Reasoning matches o3 with thinking enabled. The gap between Sonnet and Opus has widened again.

Anthropic just released Claude Opus 4.5. I dropped everything and ran my standard evaluation.

The coding numbers are the highest I've ever recorded. Let me show you.

Benchmark comparison

| Benchmark | Claude Opus 4.5 | Claude Opus 4 | o3 | Gemini 2.5 Pro | DeepSeek R2 | |-----------|-----------------|--------------|-----|---------------|-------------| | MMLU | 92.4% | 91.2% | 91.4% | 89.0% | 90.8% | | HumanEval | 98.2% | 97.1% | 94.8% | 95.1% | 95.8% | | SWE-bench Verified | 64.2% | 58.7% | 51.8% | 41.3% | 54.6% | | GPQA Diamond | 79.8% | 76.3% | 75.1% | 74.1% | 78.4% | | MATH (500) | 98.4% | 96.8% | 97.0% | 97.1% | 98.1% | | LiveCodeBench | 78.6% | 73.4% | 68.4% | 68.2% | 72.4% | | IFEval | 92.3% | 90.1% | 88.2% | 86.2% | 87.4% | | Chatbot Arena Elo | 1298 | 1288 | 1262 | 1282 | 1268 |

Sources: Anthropic Claude Opus 4.5 announcement, LMSYS Chatbot Arena, OpenAI, Google DeepMind, DeepSeek.

98.2% on HumanEval. 64.2% on SWE-bench Verified. 78.6% on LiveCodeBench. These are all new records.

SWE-bench Verified jumped from 58.7% (Opus 4) to 64.2%. A 5.5-point gain on real-world bug fixing. That's the largest single-generation improvement on this benchmark from any provider.

My 300-prompt evaluation

| Category | Opus 4.5 | Opus 4 | Delta | |----------|---------|--------|-------| | Coding (Python) | 96% | 94% | +2 | | Coding (general) | 92% | 90% | +2 | | Analysis/reasoning | 90% | 88% | +2 | | Creative writing | 90% | 88% | +2 | | Factual Q&A | 88% | 86% | +2 | | Instruction following | 94% | 90% | +4 |

Consistent +2 across all categories, with a +4 on instruction following. Not a massive leap on any single dimension, but the accumulation across categories pushes the overall score meaningfully higher.

Extended thinking improvements

| Metric | Opus 4.5 | Opus 4 | |--------|---------|--------| | Avg thinking tokens per hard problem | 4,200 | 4,800 | | Circular reasoning rate | 1.8% | 3.0% | | Thinking-to-answer quality ratio | Higher | Good |

Opus 4.5 thinks more efficiently. Fewer tokens, less circular reasoning, better quality chains. The 1.8% circular reasoning rate is the lowest I've measured on any reasoning model.

The Sonnet-Opus gap

| Benchmark | Opus 4.5 | Claude 4 Sonnet | Gap | |-----------|---------|----------------|-----| | HumanEval | 98.2% | 94.8% | +3.4 | | SWE-bench V | 64.2% | 50.1% | +14.1 | | GPQA | 79.8% | 68.4% | +11.4 | | MATH | 98.4% | 88.2% | +10.2 | | Chatbot Arena | 1298 | 1275 | +23 |

The gap between Opus and Sonnet has widened. On SWE-bench Verified, it's now 14 points. On GPQA, 11 points. On MATH, 10 points.

Anthropic's strategy of maintaining a premium tier is justified by the data. Opus 4.5 isn't "slightly better Sonnet." It's a meaningfully superior model, especially on hard tasks.

Pricing

| Model | Input/M | Output/M | |-------|---------|----------| | Claude Opus 4.5 | $15.00 | $75.00 | | Claude Opus 4 | $15.00 | $75.00 | | Claude 4 Sonnet | $2.50 | $12.50 |

Sources: Anthropic pricing page.

Same price as Opus 4. That's a quality upgrade at the same cost. For existing Opus users, this is a straightforward upgrade with no budget impact.

For Sonnet users, the question is whether the 14-point SWE-bench advantage justifies the 6x price difference. For coding-heavy workloads, the answer is increasingly yes.

My early assessment

Best model I've tested. The SWE-bench Verified and LiveCodeBench scores are the clearest signal: on real-world coding tasks, nothing else comes close right now.

The Arena Elo of 1298 puts it 16 points ahead of Gemini 2.5 Pro (1282) and 36 points ahead of o3 (1262). Those are meaningful gaps.

I'll do a full deep-dive comparison in a few weeks. For now, my spreadsheet has a new #1 in every coding column.


If you found this interesting, you might also like:

-- dataku

More from dataku