Benchmark AnalysisJuly 14, 20254 min read

The frontier model gap just closed. Five models within 20 Elo points.

For the first time, the top 5 models on Chatbot Arena are within 20 Elo points of each other. Claude Opus 4, GPT-4o, Gemini 2.5 Pro, Grok 3, and DeepSeek V3. I analyzed what "virtually tied" means for model selection.

Look at the top of the LMSYS Chatbot Arena leaderboard right now:

| Rank | Model | Elo | |------|-------|-----| | 1 | Claude Opus 4 | 1288 | | 2 | Gemini 2.5 Pro | 1282 | | 3 | Claude 3.7 Sonnet | 1278 | | 4 | GPT-4o | 1271 | | 5 | Grok 3 | 1268 |

Sources: LMSYS Chatbot Arena, July 2025 snapshot.

Twenty Elo points separate #1 from #5. For reference, 20 Elo points in chess means you'd expect the higher-rated player to win about 53% of games. Essentially a coin flip with a slight edge.

This is the tightest the frontier has ever been.

Historical gap comparison

| Date | #1 model | #5 model | Gap | |------|---------|---------|-----| | Jun 2023 | GPT-4 (1250) | Claude 1.3 (1180) | 70 | | Dec 2023 | GPT-4 Turbo (1260) | Llama 2 70B (1160) | 100 | | Jun 2024 | Claude 3.5 Sonnet (1269) | Llama 3.1 70B (1210) | 59 | | Dec 2024 | Claude 3.5 Sonnet (1273) | DeepSeek V3 (1249) | 24 | | Jul 2025 | Claude Opus 4 (1288) | Grok 3 (1268) | 20 |

Sources: LMSYS Chatbot Arena historical data.

The gap went from 100 Elo points (December 2023) to 20 (July 2025). The frontier is converging.

What "20 Elo points" actually means

In Chatbot Arena's blind comparison format, a 20-Elo difference translates to:

| Elo difference | Win probability for higher-rated model | |---------------|--------------------------------------| | 0 points | 50.0% | | 10 points | 51.4% | | 20 points | 52.8% | | 50 points | 57.1% | | 100 points | 64.0% |

At 20 points, the top model wins 52.8% of blind comparisons vs the #5 model. In practice, users can't reliably tell them apart.

Specialization matters more than overall rank

The overall Elo hides the real story. Each model has domains where it's clearly best:

| Domain | Best model | Elo in domain | |--------|-----------|--------------| | Coding | Claude Opus 4 | 1302 | | Creative writing | Claude 3.7 Sonnet | 1290 | | Math (reasoning) | Gemini 2.5 Pro | 1298 | | Multimodal | GPT-4o | 1285 | | Long context | Gemini 2.5 Pro | 1310 | | Speed | GPT-4o | N/A (fastest TTFT) | | Cost efficiency | DeepSeek V3 | N/A ($0.27/M input) |

When overall rankings are this close, the right model depends entirely on your specific use case. There is no "best model" anymore. There are "best models for X."

The implication for developers

| Old world (gap = 100 Elo) | New world (gap = 20 Elo) | |---------------------------|--------------------------| | Pick the best model, pay whatever it costs | Pick based on cost, speed, and domain strength | | One provider dominates | Multi-provider strategies make sense | | Switching models is risky | Switching is low-risk (quality similar) | | Model choice is technical decision | Model choice is economic decision |

When all frontier models are roughly equivalent in quality, the differentiators become price, speed, reliability, and API features. The "AI model" is becoming a commodity. The value shifts to the application layer.

This convergence was inevitable, but I expected it to take until 2027. We're here in mid-2025.

My model comparison spreadsheet used to have clear winners. Now it has footnotes. "Best on coding. But Gemini is better on math. But GPT-4o is faster. But DeepSeek is cheaper." Every cell has an asterisk.

The era of "just use GPT-4" is over. Welcome to the era of "it depends."


If you found this interesting, you might also like:

-- dataku

More from dataku