Gemini 2.5 Pro and "thinking" models: Google's answer to o1
Google added extended thinking to Gemini. I tested it against o1-preview and DeepSeek R1 on math and coding problems. Gemini 2.5 Pro wins on 4 of 6 benchmarks. Google is back in the reasoning race.
Google just entered the reasoning model race. And they didn't tiptoe in.
Gemini 2.5 Pro comes with an extended thinking mode that, like o1 and DeepSeek R1, uses chain-of-thought reasoning at inference time. The model "thinks" through problems step by step before giving a final answer.
I tested it against the two established reasoning models. The results are more competitive than I expected.
Head-to-head: reasoning benchmarks
| Benchmark | Gemini 2.5 Pro (thinking) | o1-preview | DeepSeek R1 | |-----------|--------------------------|------------|-------------| | MATH (500) | 95.2% | 96.4% | 97.3% | | GPQA Diamond | 74.1% | 73.3% | 71.5% | | AIME 2024 | 72.0% | 74.4% | 79.8% | | HumanEval | 95.1% | 93.7% | 92.6% | | LiveCodeBench | 68.2% | 63.4% | 65.9% | | Natural2Code | 91.4% | 87.2% | 85.3% |
Sources: Google DeepMind Gemini 2.5 technical report, OpenAI o1 system card, DeepSeek R1 paper.
Gemini 2.5 Pro wins on 4 of 6 benchmarks: GPQA Diamond (74.1% vs 73.3%), HumanEval (95.1% vs 93.7%), LiveCodeBench (68.2% vs 65.9%), and Natural2Code (91.4% vs 87.2%).
DeepSeek R1 still leads on MATH (97.3%) and AIME (79.8%). o1-preview doesn't lead on any of these six.
I did NOT expect Google to jump this far ahead on coding benchmarks. HumanEval at 95.1% is the highest I've recorded from any reasoning model.
My 200-problem evaluation
I ran all three through my standard reasoning test set:
| Category (50 problems each) | Gemini 2.5 Pro | o1-preview | DeepSeek R1 | |-----------------------------|---------------|------------|-------------| | Competition math | 80% | 82% | 84% | | Graduate science | 72% | 70% | 68% | | Coding (LeetCode Hard) | 78% | 74% | 76% | | Logical reasoning | 76% | 74% | 72% | | Overall | 76.5% | 75.0% | 75.0% |
Gemini leads by 1.5 points on my evaluation. A small margin, but it's ahead. The math category is the only one where it trails both competitors.
The cost comparison
| Metric | Gemini 2.5 Pro | o1-preview | DeepSeek R1 | |--------|---------------|------------|-------------| | Input/M tokens | $1.25 | $15.00 | $0.55 | | Output/M tokens | $10.00 | $60.00 | $2.19 | | Thinking tokens/M | $1.25 | Hidden (included in output) | $2.19 | | Avg cost per hard problem | ~$0.08 | ~$0.87 | ~$0.022 |
Sources: Google, OpenAI, DeepSeek pricing pages.
Google priced Gemini 2.5 Pro at $1.25/M input. That's 12x cheaper than o1-preview's $15/M. But 2.3x more expensive than DeepSeek R1's $0.55/M.
The average hard reasoning problem costs about $0.08 on Gemini 2.5 Pro. That's 10x cheaper than o1-preview ($0.87) but 3.6x more than DeepSeek R1 ($0.022).
DeepSeek R1 remains the clear cost leader. But Gemini 2.5 Pro offers a middle ground: better benchmarks than R1 on coding and science, at a price that's much more accessible than o1.
The thinking token transparency
Google followed DeepSeek's lead here: thinking tokens are visible. You can see the model's reasoning process. This is different from o1, where the thinking tokens are hidden.
| Feature | Gemini 2.5 Pro | o1-preview | DeepSeek R1 | |---------|---------------|------------|-------------| | Thinking visible | Yes | No (summarized) | Yes | | Thinking toggle | Yes (on/off) | No (always on) | Model-dependent | | Max thinking tokens | 24,576 | Unknown | ~32K |
Sources: Google AI Studio, OpenAI docs, DeepSeek API docs.
The toggle is useful. For simple questions, you can turn thinking off and get fast, cheap responses. For hard problems, turn it on and pay for the extra reasoning. o1 doesn't give you that choice.
Where this leaves the reasoning race
| Model | Best at | Worst at | Cost tier | |-------|---------|----------|----------| | DeepSeek R1 | Math, cost efficiency | Science Q&A | Cheapest | | Gemini 2.5 Pro | Coding, science Q&A | Pure math | Mid-range | | o1-preview | Consistency | Cost efficiency | Expensive |
Three reasoning models, three different strengths. The "best" depends entirely on what you're using it for and how much you're willing to spend.
Six months ago, o1 was the only reasoning model. Now there are three competitive options and the cheapest one costs 40x less than the original. The reasoning model market is speed-running the same commoditization curve that chat models went through in 2023-2024.
My spreadsheet for reasoning models has 14 columns now. I started with 3.
If you found this interesting, you might also like:
- GPT-3 vs GPT-J: the first real open source challenger, in data
- Google's PaLM has 540 billion parameters. Let me put that number in context.
- ChatGPT vs GPT-3: same model family, wildly different results. The data.
- Claude vs GPT-4: my first head-to-head data comparison
- Mistral Large vs GPT-4 vs Claude 3 Opus: the three-way benchmark
-- dataku