Claude Opus 4.5: Anthropic's latest flagship, benchmarked

Anthropic just released Claude Opus 4.5. I dropped everything and ran my standard evaluation.

The coding numbers are the highest I've ever recorded. Let me show you.

Benchmark comparison

| Benchmark | Claude Opus 4.5 | Claude Opus 4 | o3 | Gemini 2.5 Pro | DeepSeek R2 | |-----------|-----------------|--------------|-----|---------------|-------------| | MMLU | 92.4% | 91.2% | 91.4% | 89.0% | 90.8% | | HumanEval | 98.2% | 97.1% | 94.8% | 95.1% | 95.8% | | SWE-bench Verified | 64.2% | 58.7% | 51.8% | 41.3% | 54.6% | | GPQA Diamond | 79.8% | 76.3% | 75.1% | 74.1% | 78.4% | | MATH (500) | 98.4% | 96.8% | 97.0% | 97.1% | 98.1% | | LiveCodeBench | 78.6% | 73.4% | 68.4% | 68.2% | 72.4% | | IFEval | 92.3% | 90.1% | 88.2% | 86.2% | 87.4% | | Chatbot Arena Elo | 1298 | 1288 | 1262 | 1282 | 1268 |

Sources: Anthropic Claude Opus 4.5 announcement, LMSYS Chatbot Arena, OpenAI, Google DeepMind, DeepSeek.

98.2% on HumanEval. 64.2% on SWE-bench Verified. 78.6% on LiveCodeBench. These are all new records.

SWE-bench Verified jumped from 58.7% (Opus 4) to 64.2%. A 5.5-point gain on real-world bug fixing. That's the largest single-generation improvement on this benchmark from any provider.

My 300-prompt evaluation

| Category | Opus 4.5 | Opus 4 | Delta | |----------|---------|--------|-------| | Coding (Python) | 96% | 94% | +2 | | Coding (general) | 92% | 90% | +2 | | Analysis/reasoning | 90% | 88% | +2 | | Creative writing | 90% | 88% | +2 | | Factual Q&A | 88% | 86% | +2 | | Instruction following | 94% | 90% | +4 |

Consistent +2 across all categories, with a +4 on instruction following. Not a massive leap on any single dimension, but the accumulation across categories pushes the overall score meaningfully higher.

Extended thinking improvements

| Metric | Opus 4.5 | Opus 4 | |--------|---------|--------| | Avg thinking tokens per hard problem | 4,200 | 4,800 | | Circular reasoning rate | 1.8% | 3.0% | | Thinking-to-answer quality ratio | Higher | Good |

Opus 4.5 thinks more efficiently. Fewer tokens, less circular reasoning, better quality chains. The 1.8% circular reasoning rate is the lowest I've measured on any reasoning model.

The Sonnet-Opus gap

| Benchmark | Opus 4.5 | Claude 4 Sonnet | Gap | |-----------|---------|----------------|-----| | HumanEval | 98.2% | 94.8% | +3.4 | | SWE-bench V | 64.2% | 50.1% | +14.1 | | GPQA | 79.8% | 68.4% | +11.4 | | MATH | 98.4% | 88.2% | +10.2 | | Chatbot Arena | 1298 | 1275 | +23 |

The gap between Opus and Sonnet has widened. On SWE-bench Verified, it's now 14 points. On GPQA, 11 points. On MATH, 10 points.

Anthropic's strategy of maintaining a premium tier is justified by the data. Opus 4.5 isn't "slightly better Sonnet." It's a meaningfully superior model, especially on hard tasks.

Pricing

| Model | Input/M | Output/M | |-------|---------|----------| | Claude Opus 4.5 | $15.00 | $75.00 | | Claude Opus 4 | $15.00 | $75.00 | | Claude 4 Sonnet | $2.50 | $12.50 |

Sources: Anthropic pricing page.

Same price as Opus 4. That's a quality upgrade at the same cost. For existing Opus users, this is a straightforward upgrade with no budget impact.

For Sonnet users, the question is whether the 14-point SWE-bench advantage justifies the 6x price difference. For coding-heavy workloads, the answer is increasingly yes.

My early assessment

Best model I've tested. The SWE-bench Verified and LiveCodeBench scores are the clearest signal: on real-world coding tasks, nothing else comes close right now.

The Arena Elo of 1298 puts it 16 points ahead of Gemini 2.5 Pro (1282) and 36 points ahead of o3 (1262). Those are meaningful gaps.

I'll do a full deep-dive comparison in a few weeks. For now, my spreadsheet has a new #1 in every coding column.

If you found this interesting, you might also like:

-- dataku

Claude Opus 4.5: Anthropic's latest flagship, benchmarked

Benchmark comparison

My 300-prompt evaluation

Extended thinking improvements

The Sonnet-Opus gap

Pricing

My early assessment

More from dataku

My monthly benchmark dashboard: March 2026 update

The state of AI benchmarks in early 2026: what still works?

The LLM leaderboard is dead, long live the leaderboard