Gemini 2.5 Pro just took #1 on Chatbot Arena. The data behind the shift.
For the first time, a Google model sits at the top of the LMSYS leaderboard. I analyzed the vote patterns. Gemini 2.5 Pro dominates in coding and math. Claude still leads in creative tasks. The throne is now contestable.
For the first time in the history of the LMSYS Chatbot Arena, a Google model sits at #1.
Gemini 2.5 Pro just took the top spot. I've been watching the vote counts climb for two weeks and today it officially flipped.
The new leaderboard
| Rank | Model | Overall Elo | Change | |------|-------|-------------|--------| | 1 | Gemini 2.5 Pro | 1282 | +12 (was #3) | | 2 | Claude 3.7 Sonnet | 1278 | -2 (was #1) | | 3 | GPT-4o (latest) | 1268 | +1 | | 4 | Grok 3 | 1264 | +2 | | 5 | DeepSeek V3 | 1258 | +3 | | 6 | DeepSeek R1 | 1255 | New entry | | 7 | Llama 4 Maverick | 1248 | New entry |
Sources: LMSYS Chatbot Arena, April 2025 snapshot.
Gemini 2.5 Pro at 1282. Claude 3.7 Sonnet at 1278. A 4-point gap. Narrow, but consistent across multiple days of voting.
Category breakdown
The overall number hides the real story. Let me break it down by category:
| Category | #1 model | #2 model | Gap | |----------|---------|---------|-----| | Coding | Gemini 2.5 Pro (1296) | Claude 3.7 Sonnet (1290) | +6 | | Math | Gemini 2.5 Pro (1298) | DeepSeek R1 (1292) | +6 | | Creative writing | Claude 3.7 Sonnet (1288) | Gemini 2.5 Pro (1271) | +17 | | Instruction following | Claude 3.7 Sonnet (1284) | Gemini 2.5 Pro (1280) | +4 | | Factual Q&A | Gemini 2.5 Pro (1290) | GPT-4o (1284) | +6 | | Long context | Gemini 2.5 Pro (1302) | Claude 3.7 Sonnet (1268) | +34 |
Sources: LMSYS Chatbot Arena category leaderboards.
Gemini 2.5 Pro leads in coding, math, factual Q&A, and long context. Claude 3.7 Sonnet leads in creative writing and instruction following.
That long context gap is massive: +34 Elo. Google's 1M+ token context window gives it a structural advantage that other models can't match yet.
How the #1 changed hands
| Period | #1 model | Duration | |--------|---------|----------| | Mar 2023 - Aug 2023 | GPT-4 | ~5 months | | Aug 2023 - Feb 2024 | GPT-4 Turbo | ~6 months | | Feb 2024 - Jun 2024 | Claude 3 Opus | ~4 months | | Jun 2024 - Mar 2025 | Claude 3.5 Sonnet / 3.7 Sonnet | ~9 months | | Apr 2025 - present | Gemini 2.5 Pro | ? |
Sources: LMSYS Chatbot Arena historical data.
Anthropic held the #1 spot for 9 months (combined 3.5 and 3.7 Sonnet). Google's ascent breaks the longest streak in Arena history.
What changed
Google didn't just make a slightly better model. They improved in the specific categories that drive Arena votes:
| Area of improvement | Evidence | |-------------------|---------| | Coding quality | +6 Elo over Claude, first time Google has led on coding | | Math reasoning | Extended thinking now competitive with dedicated reasoning models | | Long context usage | Arena voters increasingly test with long documents | | Response speed | Gemini 2.5 Pro is fast, which matters in blind comparisons |
Speed is an underappreciated factor in Arena. When voters compare two responses side by side, the faster model gets a slight psychological advantage. Gemini 2.5 Pro is noticeably faster than Claude 3.7 Sonnet in my experience.
Is this permanent?
Probably not. Claude 3.7 Sonnet is already 9 months old by model standards. Anthropic presumably has something newer in the pipeline. OpenAI hasn't released their next-gen general model (o3 is reasoning-only, GPT-4.5 is a research preview).
The Arena leaderboard is now genuinely competitive among three providers:
| Provider | Best model | Elo | Gap to #1 | |----------|-----------|-----|-----------| | Google | Gemini 2.5 Pro | 1282 | 0 | | Anthropic | Claude 3.7 Sonnet | 1278 | -4 | | OpenAI | GPT-4o | 1268 | -14 |
Google: leading. Anthropic: within striking distance. OpenAI: falling behind on the general leaderboard (though o3 leads on reasoning-specific tasks).
I've been checking the Arena daily since 2023. This is the first time I'd describe the top as "genuinely contested." Not one dominant model. Not two close competitors. Three providers within 14 Elo points. Any major release could shuffle the order.
My morning Arena check just got more interesting.
If you found this interesting, you might also like:
- DALL-E 2 is out. I ran 200 prompts and measured the results.
- InstructGPT and RLHF: what the training data tells us
- Claude 2 is surprisingly good at long documents. Here's my data.
- Google Gemini benchmarks vs GPT-4: reading the fine print
- Gemini 1.5 Pro has a 1 million token context window. I tested it with real documents.
-- dataku