Gemini 2.5 Ultra: Google's best model vs the field
Google finally released Ultra-tier Gemini 2.5. I compared it against Claude Opus 4.5, GPT-4o, and DeepSeek R2 across 300 prompts. Gemini Ultra wins on multimodal tasks and long context. Claude wins on coding. The frontier is genuinely multi-polar now.
Google finally shipped the Ultra tier of Gemini 2.5. I've been waiting for this one.
The four-way comparison
| Benchmark | Gemini 2.5 Ultra | Claude Opus 4.5 | GPT-4o (Feb) | DeepSeek R2 | |-----------|-----------------|-----------------|-------------|-------------| | MMLU | 92.1% | 92.4% | 89.2% | 90.8% | | HumanEval | 96.4% | 98.2% | 91.8% | 95.8% | | SWE-bench V | 52.8% | 64.2% | 36.1% | 54.6% | | GPQA Diamond | 80.2% | 79.8% | 56.4% | 78.4% | | MATH (500) | 98.2% | 98.4% | 78.1% | 98.1% | | LiveCodeBench | 71.2% | 78.6% | 58.4% | 72.4% | | Chatbot Arena | 1294 | 1298 | 1272 | 1268 | | Vision (ChartQA) | 94.2% | 89.4% | 83.8% | N/A | | Long context (1M retrieval) | 91.4% | 78.2% | N/A | N/A |
Sources: Google DeepMind Gemini 2.5 Ultra announcement, Anthropic, OpenAI, DeepSeek, LMSYS Chatbot Arena.
Where Gemini Ultra wins
Multimodal/vision (94.2% on ChartQA): The best chart and image understanding I've ever tested. It reads data from images with near-human accuracy. Claude at 89.4% is good, but Gemini's 94.2% is a clear lead.
Long context (91.4% retrieval at 1M tokens): Google's 2M token context window isn't just big. It's accurate. 91.4% needle-in-a-haystack retrieval at 1 million tokens. Claude drops to 78.2% at equivalent lengths. Nobody else even offers this test.
GPQA Diamond (80.2%): First model to break 80% on graduate-level science questions. Tied with Claude Opus 4.5 (79.8% rounds to 80% but Gemini edges it).
Where Claude Opus 4.5 still wins
Coding: HumanEval 98.2% vs 96.4%. SWE-bench Verified 64.2% vs 52.8%. LiveCodeBench 78.6% vs 71.2%. The coding gap is 7-11 points depending on benchmark. For developers, Claude remains the clear choice.
Overall Arena Elo (1298 vs 1294): A tiny gap (4 points, essentially a statistical tie). But Claude has held the #1 spot consistently, suggesting a slight overall preference from human voters.
Instruction following: In my testing, Claude follows complex, multi-step instructions more precisely.
My 300-prompt evaluation
| Category | Gemini Ultra | Claude Opus 4.5 | GPT-4o | DeepSeek R2 | |----------|------------|-----------------|--------|-------------| | Coding | 86% | 94% | 82% | 86% | | Vision/multimodal | 92% | 86% | 82% | N/A | | Long document analysis | 90% | 80% | 72% | 74% | | Analysis/reasoning | 88% | 90% | 82% | 84% | | Creative writing | 80% | 90% | 78% | 72% | | Factual Q&A | 90% | 88% | 90% | 86% |
Gemini wins on vision and long documents. Claude wins on coding, creative writing, and general reasoning. GPT-4o wins on... nothing specifically in this comparison.
Pricing
| Model | Input/M | Output/M | Context window | |-------|---------|----------|---------------| | Gemini 2.5 Ultra | $5.00 | $20.00 | 2M | | Claude Opus 4.5 | $15.00 | $75.00 | 200K | | GPT-4o | $2.00 | $8.00 | 128K | | DeepSeek R2 | $0.20 | $0.80 | 128K |
Sources: Google, Anthropic, OpenAI, DeepSeek.
Gemini Ultra at $5/$20 is priced between Sonnet-tier and Opus-tier. It's 3x cheaper than Claude Opus 4.5 per output token. Given the quality is competitive (4 Arena Elo points apart), the value proposition is strong.
The multi-polar frontier
| Model | What it's best at | |-------|------------------| | Claude Opus 4.5 | Coding, creative writing, instruction following | | Gemini 2.5 Ultra | Vision, long context, science Q&A | | DeepSeek R2 | Math reasoning, cost efficiency | | GPT-4o | Speed, general-purpose at moderate cost |
Four models, four different strengths, three different price points.
The frontier in February 2026 is genuinely multi-polar. No single model dominates across all categories. The "best model" question now requires the follow-up: "best at what?"
My comparison spreadsheet has 4 columns highlighted in green (category leaders) and they're split across 3 providers. That's never happened before.
If you found this interesting, you might also like:
- Google's PaLM has 540 billion parameters. Let me put that number in context.
- Google Gemini 2.0 Flash: the speed-to-quality ratio is unprecedented
- [Gemini 2.5 Pro and ](/blog/gemini-2-5-pro-thinking-models-google-answer-o1)
- Claude 4 Sonnet vs GPT-4o vs Gemini 2.5 Flash: the mid-tier model war
- Claude Opus 4 vs GPT-4o vs Gemini 2.5 Pro: the definitive Q4 comparison
-- dataku