Claude Opus 4 is here. My first benchmark impressions.
Anthropic's new flagship model. Extended thinking, tool use, and code generation all feel meaningfully better. I ran my standard 300-prompt evaluation. Early data: it's the best model I've tested on coding tasks. Full analysis next week.
Anthropic just released Claude Opus 4. Their new flagship. Extended thinking built in, improved tool use, and what they call "the most capable model we've ever built."
I ran my standard 300-prompt evaluation over the weekend. These are first impressions, not a definitive review. But the early numbers are worth sharing.
Quick benchmark comparison
| Benchmark | Claude Opus 4 | Claude 3.7 Sonnet | GPT-4o | Gemini 2.5 Pro | |-----------|--------------|-------------------|--------|---------------| | MMLU | 91.2% | 89.4% | 88.7% | 89.0% | | HumanEval | 97.1% | 95.2% | 90.2% | 95.1% | | SWE-bench Verified | 58.7% | 52.4% | 33.2% | 41.3% | | GPQA Diamond | 76.3% | 62.1% | 53.6% | 74.1% | | MATH (500) | 96.8% | 89.4% (thinking) | 76.6% | 95.2% | | LiveCodeBench | 73.4% | 70.1% (thinking) | 55.8% | 68.2% | | IFEval | 90.1% | 88.3% | 85.4% | 86.2% | | Chatbot Arena Elo | 1288 | 1278 | 1268 | 1282 |
Sources: Anthropic Claude Opus 4 announcement, LMSYS Chatbot Arena, OpenAI, Google DeepMind.
The coding numbers jump out immediately. HumanEval at 97.1% is the highest I've recorded from any model. SWE-bench Verified at 58.7% is a 6-point jump from Claude 3.7 Sonnet and nearly double GPT-4o's score.
GPQA Diamond at 76.3% is also a new high. For graduate-level science questions, Opus 4 just set the record.
My 300-prompt evaluation
| Category (50 prompts each) | Claude Opus 4 | Claude 3.7 Sonnet | GPT-4o | Gemini 2.5 Pro | |----------------------------|--------------|-------------------|--------|---------------| | Coding (Python) | 94% | 88% | 82% | 86% | | Coding (general) | 90% | 82% | 80% | 84% | | Analysis/reasoning | 88% | 84% | 78% | 86% | | Creative writing | 88% | 86% | 76% | 78% | | Factual Q&A | 86% | 84% | 88% | 86% | | Instruction following | 90% | 86% | 82% | 84% |
Opus 4 leads in every category except factual Q&A (where GPT-4o still edges ahead by 2 points).
Coding at 94% (Python) is wild. I'm running out of test cases that trip it up. I need harder problems.
The extended thinking improvement
Opus 4 has extended thinking built in (like Claude 3.7 Sonnet), but it feels more... deliberate. The thinking is more structured and less prone to going in circles.
| Metric | Opus 4 (thinking) | Claude 3.7 Sonnet (thinking) | |--------|--------------------|------------------------------| | Avg thinking tokens | 4,800 | 5,800 | | Thinking-to-answer ratio | 2.1:1 | 2.8:1 | | Circular reasoning rate | 3% | 8% |
Opus 4 thinks less but gets to the answer faster. The circular reasoning rate (where the model repeats the same reasoning step) dropped from 8% to 3%.
Tool use quality
I tested tool use across 20 tasks:
| Metric | Opus 4 | Claude 3.7 Sonnet | |--------|--------|--------------------| | Correct tool selection | 94% | 87% | | First-try task completion | 72% | 61% | | Total loops needed (avg) | 2.3 | 3.1 | | Tool call errors | 4% | 9% |
Tool selection accuracy up from 87% to 94%. First-try completion from 61% to 72%. Fewer loops, fewer errors.
For agent-based workflows, this is significant. Fewer loops means fewer tokens means lower cost per task.
Pricing
| Model | Input/M | Output/M | |-------|---------|----------| | Claude Opus 4 | $15.00 | $75.00 | | Claude 3.7 Sonnet | $3.00 | $15.00 | | GPT-4o | $2.50 | $10.00 | | Gemini 2.5 Pro | $1.25 | $10.00 |
Sources: Anthropic, OpenAI, Google.
At $15/$75, Opus 4 is 5x more expensive than Sonnet. This is the same ratio as Claude 3 Opus to Claude 3.5 Sonnet. Anthropic's pricing strategy hasn't changed: a premium flagship and a cost-effective workhorse.
For most production applications, Claude 3.7 Sonnet remains the better value. But for complex coding, research analysis, and tasks where the 6-point SWE-bench advantage matters, Opus 4 is the new ceiling.
First impressions summary
| Category | Verdict | |----------|---------| | Coding | Best I've tested. Period. | | Reasoning | Matches or beats o1/R1 with better efficiency | | Creative writing | Slight improvement over 3.7 Sonnet | | Tool use | Significant improvement, fewer loops | | Cost | Premium pricing, justified for complex tasks | | Overall | New #1 on Chatbot Arena, reclaims the throne from Gemini |
Anthropic just took the Arena #1 back from Google. After 3 weeks.
I expected Opus 4 to be good. I didn't expect 97.1% on HumanEval. That number needs more verification, and I'll do a deeper analysis next week. But if it holds up, this is the most capable model released in 2025 so far.
Full deep-dive comparison coming soon. Right now I need to update approximately 47 cells in my spreadsheet.
If you found this interesting, you might also like:
- Mistral Large vs GPT-4 vs Claude 3 Opus: the three-way benchmark
- Claude 3.5 Sonnet (new) and computer use: my first benchmark data
- Claude vs GPT-4: my first head-to-head data comparison
- Llama 2 is here and it's actually good. My benchmark data.
- I benchmarked 12 coding assistants. Cursor is not what I expected.
-- dataku