Claude Opus 4 is here. My first benchmark impressions.

Anthropic just released Claude Opus 4. Their new flagship. Extended thinking built in, improved tool use, and what they call "the most capable model we've ever built."

I ran my standard 300-prompt evaluation over the weekend. These are first impressions, not a definitive review. But the early numbers are worth sharing.

Quick benchmark comparison

| Benchmark | Claude Opus 4 | Claude 3.7 Sonnet | GPT-4o | Gemini 2.5 Pro | |-----------|--------------|-------------------|--------|---------------| | MMLU | 91.2% | 89.4% | 88.7% | 89.0% | | HumanEval | 97.1% | 95.2% | 90.2% | 95.1% | | SWE-bench Verified | 58.7% | 52.4% | 33.2% | 41.3% | | GPQA Diamond | 76.3% | 62.1% | 53.6% | 74.1% | | MATH (500) | 96.8% | 89.4% (thinking) | 76.6% | 95.2% | | LiveCodeBench | 73.4% | 70.1% (thinking) | 55.8% | 68.2% | | IFEval | 90.1% | 88.3% | 85.4% | 86.2% | | Chatbot Arena Elo | 1288 | 1278 | 1268 | 1282 |

Sources: Anthropic Claude Opus 4 announcement, LMSYS Chatbot Arena, OpenAI, Google DeepMind.

The coding numbers jump out immediately. HumanEval at 97.1% is the highest I've recorded from any model. SWE-bench Verified at 58.7% is a 6-point jump from Claude 3.7 Sonnet and nearly double GPT-4o's score.

GPQA Diamond at 76.3% is also a new high. For graduate-level science questions, Opus 4 just set the record.

My 300-prompt evaluation

| Category (50 prompts each) | Claude Opus 4 | Claude 3.7 Sonnet | GPT-4o | Gemini 2.5 Pro | |----------------------------|--------------|-------------------|--------|---------------| | Coding (Python) | 94% | 88% | 82% | 86% | | Coding (general) | 90% | 82% | 80% | 84% | | Analysis/reasoning | 88% | 84% | 78% | 86% | | Creative writing | 88% | 86% | 76% | 78% | | Factual Q&A | 86% | 84% | 88% | 86% | | Instruction following | 90% | 86% | 82% | 84% |

Opus 4 leads in every category except factual Q&A (where GPT-4o still edges ahead by 2 points).

Coding at 94% (Python) is wild. I'm running out of test cases that trip it up. I need harder problems.

The extended thinking improvement

Opus 4 has extended thinking built in (like Claude 3.7 Sonnet), but it feels more... deliberate. The thinking is more structured and less prone to going in circles.

| Metric | Opus 4 (thinking) | Claude 3.7 Sonnet (thinking) | |--------|--------------------|------------------------------| | Avg thinking tokens | 4,800 | 5,800 | | Thinking-to-answer ratio | 2.1:1 | 2.8:1 | | Circular reasoning rate | 3% | 8% |

Opus 4 thinks less but gets to the answer faster. The circular reasoning rate (where the model repeats the same reasoning step) dropped from 8% to 3%.

Tool use quality

I tested tool use across 20 tasks:

| Metric | Opus 4 | Claude 3.7 Sonnet | |--------|--------|--------------------| | Correct tool selection | 94% | 87% | | First-try task completion | 72% | 61% | | Total loops needed (avg) | 2.3 | 3.1 | | Tool call errors | 4% | 9% |

Tool selection accuracy up from 87% to 94%. First-try completion from 61% to 72%. Fewer loops, fewer errors.

For agent-based workflows, this is significant. Fewer loops means fewer tokens means lower cost per task.

Pricing

| Model | Input/M | Output/M | |-------|---------|----------| | Claude Opus 4 | $15.00 | $75.00 | | Claude 3.7 Sonnet | $3.00 | $15.00 | | GPT-4o | $2.50 | $10.00 | | Gemini 2.5 Pro | $1.25 | $10.00 |

Sources: Anthropic, OpenAI, Google.

At $15/$75, Opus 4 is 5x more expensive than Sonnet. This is the same ratio as Claude 3 Opus to Claude 3.5 Sonnet. Anthropic's pricing strategy hasn't changed: a premium flagship and a cost-effective workhorse.

For most production applications, Claude 3.7 Sonnet remains the better value. But for complex coding, research analysis, and tasks where the 6-point SWE-bench advantage matters, Opus 4 is the new ceiling.

First impressions summary

| Category | Verdict | |----------|---------| | Coding | Best I've tested. Period. | | Reasoning | Matches or beats o1/R1 with better efficiency | | Creative writing | Slight improvement over 3.7 Sonnet | | Tool use | Significant improvement, fewer loops | | Cost | Premium pricing, justified for complex tasks | | Overall | New #1 on Chatbot Arena, reclaims the throne from Gemini |

Anthropic just took the Arena #1 back from Google. After 3 weeks.

I expected Opus 4 to be good. I didn't expect 97.1% on HumanEval. That number needs more verification, and I'll do a deeper analysis next week. But if it holds up, this is the most capable model released in 2025 so far.

Full deep-dive comparison coming soon. Right now I need to update approximately 47 cells in my spreadsheet.

If you found this interesting, you might also like:

-- dataku

Claude Opus 4 is here. My first benchmark impressions.

Quick benchmark comparison

My 300-prompt evaluation

The extended thinking improvement

Tool use quality

Pricing

First impressions summary

More from dataku

Claude Opus 4.6 review: the 1M context model

o4-mini vs Claude 4 Sonnet vs Gemini 2.5 Flash: the speed tier showdown

Gemini 2.5 Ultra: Google's best model vs the field