Benchmark AnalysisFebruary 3, 20254 min read

Claude 3.5 Sonnet is still #1 on Chatbot Arena. For how long?

Six months at the top of the LMSYS leaderboard. I pulled the vote data and looked at the categories where Claude 3.5 Sonnet wins most decisively: coding (Elo 1290), creative writing (1285), and instruction following (1280).

Six months.

Anthropic's Claude 3.5 Sonnet has been sitting at the top of the LMSYS Chatbot Arena since it launched, and nothing has knocked it off. Not GPT-4o. Not Gemini 2.0. Not DeepSeek V3.

I pulled the vote data and dug into why.

The Elo standings (as of February 2025)

| Rank | Model | Overall Elo | Coding | Creative Writing | Instruction Following | |------|-------|-------------|--------|-----------------|----------------------| | 1 | Claude 3.5 Sonnet (new) | 1269 | 1290 | 1285 | 1280 | | 2 | GPT-4o (Nov) | 1261 | 1275 | 1258 | 1267 | | 3 | Gemini 2.0 Flash | 1254 | 1260 | 1248 | 1255 | | 4 | DeepSeek V3 | 1249 | 1252 | 1240 | 1244 | | 5 | Grok 2 | 1243 | 1240 | 1246 | 1238 | | 6 | Llama 3.1 405B | 1230 | 1225 | 1228 | 1235 | | 7 | Mistral Large 2 | 1222 | 1218 | 1220 | 1226 |

Sources: LMSYS Chatbot Arena leaderboard, February 2025 snapshot.

Claude 3.5 Sonnet leads by 8 Elo points overall. That's a small margin in absolute terms, but it's been consistent for six months. The model is winning a slight majority of blind head-to-head comparisons against every other model.

Where Claude dominates

The category breakdowns tell the real story. Claude's biggest leads are in:

| Category | Claude 3.5 Sonnet Elo | Gap to #2 | #2 model | |----------|----------------------|-----------|----------| | Coding | 1290 | +15 | GPT-4o | | Creative writing | 1285 | +27 | GPT-4o | | Instruction following | 1280 | +13 | GPT-4o | | Math | 1255 | +3 | GPT-4o | | Knowledge/factual | 1260 | -2 | GPT-4o (leads) |

Sources: LMSYS Chatbot Arena category leaderboards.

Coding: +15 Elo over GPT-4o. Creative writing: +27 Elo (that's a big gap). Instruction following: +13 Elo.

The one area where GPT-4o edges ahead is factual knowledge/retrieval, by a narrow 2 points. On everything else, Claude leads.

The durability question

Here's what fascinates me about this data. Most #1 models hold the top spot for 2-4 months before getting displaced. GPT-4 held it for about 5 months (March to August 2023). GPT-4 Turbo held it for roughly 3 months.

Claude 3.5 Sonnet is at 6 months and counting.

| Model | Time at #1 | Displaced by | |-------|-----------|-------------| | GPT-4 | ~5 months | GPT-4 Turbo | | GPT-4 Turbo | ~3 months | Claude 3 Opus | | Claude 3 Opus | ~3 months | Claude 3.5 Sonnet | | Claude 3.5 Sonnet | 6+ months | ? |

Sources: LMSYS Chatbot Arena historical data, UC Berkeley LMSYS research.

The threat board is getting crowded though. DeepSeek R1 just launched with reasoning capabilities that exceed Claude on math-specific tasks. Gemini 2.5 Pro is rumored. OpenAI hasn't released a major general model since GPT-4o.

My prediction

I think Claude 3.5 Sonnet's reign ends in March or April 2025. Either Anthropic replaces it with something newer (Claude 3.6? Claude 3.7?), or a competitor makes a jump.

But six months at #1, across hundreds of thousands of human votes, in the most competitive model market we've ever had? That's impressive by any metric.

My morning ritual of checking the Arena leaderboard has become predictable. Same model at the top. Every day.

I almost miss the chaos.


If you found this interesting, you might also like:

-- dataku

More from dataku