Model ComparisonsFebruary 24, 20264 min read

Gemini 2.5 Ultra: Google's best model vs the field

Google finally released Ultra-tier Gemini 2.5. I compared it against Claude Opus 4.5, GPT-4o, and DeepSeek R2 across 300 prompts. Gemini Ultra wins on multimodal tasks and long context. Claude wins on coding. The frontier is genuinely multi-polar now.

Google finally shipped the Ultra tier of Gemini 2.5. I've been waiting for this one.

The four-way comparison

| Benchmark | Gemini 2.5 Ultra | Claude Opus 4.5 | GPT-4o (Feb) | DeepSeek R2 | |-----------|-----------------|-----------------|-------------|-------------| | MMLU | 92.1% | 92.4% | 89.2% | 90.8% | | HumanEval | 96.4% | 98.2% | 91.8% | 95.8% | | SWE-bench V | 52.8% | 64.2% | 36.1% | 54.6% | | GPQA Diamond | 80.2% | 79.8% | 56.4% | 78.4% | | MATH (500) | 98.2% | 98.4% | 78.1% | 98.1% | | LiveCodeBench | 71.2% | 78.6% | 58.4% | 72.4% | | Chatbot Arena | 1294 | 1298 | 1272 | 1268 | | Vision (ChartQA) | 94.2% | 89.4% | 83.8% | N/A | | Long context (1M retrieval) | 91.4% | 78.2% | N/A | N/A |

Sources: Google DeepMind Gemini 2.5 Ultra announcement, Anthropic, OpenAI, DeepSeek, LMSYS Chatbot Arena.

Where Gemini Ultra wins

Multimodal/vision (94.2% on ChartQA): The best chart and image understanding I've ever tested. It reads data from images with near-human accuracy. Claude at 89.4% is good, but Gemini's 94.2% is a clear lead.

Long context (91.4% retrieval at 1M tokens): Google's 2M token context window isn't just big. It's accurate. 91.4% needle-in-a-haystack retrieval at 1 million tokens. Claude drops to 78.2% at equivalent lengths. Nobody else even offers this test.

GPQA Diamond (80.2%): First model to break 80% on graduate-level science questions. Tied with Claude Opus 4.5 (79.8% rounds to 80% but Gemini edges it).

Where Claude Opus 4.5 still wins

Coding: HumanEval 98.2% vs 96.4%. SWE-bench Verified 64.2% vs 52.8%. LiveCodeBench 78.6% vs 71.2%. The coding gap is 7-11 points depending on benchmark. For developers, Claude remains the clear choice.

Overall Arena Elo (1298 vs 1294): A tiny gap (4 points, essentially a statistical tie). But Claude has held the #1 spot consistently, suggesting a slight overall preference from human voters.

Instruction following: In my testing, Claude follows complex, multi-step instructions more precisely.

My 300-prompt evaluation

| Category | Gemini Ultra | Claude Opus 4.5 | GPT-4o | DeepSeek R2 | |----------|------------|-----------------|--------|-------------| | Coding | 86% | 94% | 82% | 86% | | Vision/multimodal | 92% | 86% | 82% | N/A | | Long document analysis | 90% | 80% | 72% | 74% | | Analysis/reasoning | 88% | 90% | 82% | 84% | | Creative writing | 80% | 90% | 78% | 72% | | Factual Q&A | 90% | 88% | 90% | 86% |

Gemini wins on vision and long documents. Claude wins on coding, creative writing, and general reasoning. GPT-4o wins on... nothing specifically in this comparison.

Pricing

| Model | Input/M | Output/M | Context window | |-------|---------|----------|---------------| | Gemini 2.5 Ultra | $5.00 | $20.00 | 2M | | Claude Opus 4.5 | $15.00 | $75.00 | 200K | | GPT-4o | $2.00 | $8.00 | 128K | | DeepSeek R2 | $0.20 | $0.80 | 128K |

Sources: Google, Anthropic, OpenAI, DeepSeek.

Gemini Ultra at $5/$20 is priced between Sonnet-tier and Opus-tier. It's 3x cheaper than Claude Opus 4.5 per output token. Given the quality is competitive (4 Arena Elo points apart), the value proposition is strong.

The multi-polar frontier

| Model | What it's best at | |-------|------------------| | Claude Opus 4.5 | Coding, creative writing, instruction following | | Gemini 2.5 Ultra | Vision, long context, science Q&A | | DeepSeek R2 | Math reasoning, cost efficiency | | GPT-4o | Speed, general-purpose at moderate cost |

Four models, four different strengths, three different price points.

The frontier in February 2026 is genuinely multi-polar. No single model dominates across all categories. The "best model" question now requires the follow-up: "best at what?"

My comparison spreadsheet has 4 columns highlighted in green (category leaders) and they're split across 3 providers. That's never happened before.


If you found this interesting, you might also like:

-- dataku

More from dataku