LMSYS Chatbot Arena has 200K votes. It might be the best benchmark we have.
LMSYS's crowdsourced Elo ratings are based on 200K+ human votes of blind model comparisons. I analyzed the vote distributions and demographic patterns. It's noisy, but it's the closest thing to 'what real users think.'
I keep coming back to the LMSYS Chatbot Arena. Every time I compare it to traditional benchmarks, it tells a slightly different story. And I'm starting to think its story is more honest.
The Arena, run by researchers at UC Berkeley and the LMSYS group, is simple. Two anonymous models respond to the same prompt. A human picks which response is better. Votes accumulate into Elo ratings (the chess ranking system). No fixed test set. No multiple choice. Just humans judging outputs blind.
As of early September 2023, they've collected over 200,000 votes. Let me dig into what that data actually looks like.
Current Elo rankings (September 2023)
| Rank | Model | Elo rating | Confidence interval (+/-) | |------|-------|-----------|--------------------------| | 1 | GPT-4 | 1256 | 5 | | 2 | Claude v1.3 | 1155 | 8 | | 3 | GPT-3.5-turbo | 1130 | 6 | | 4 | Claude Instant | 1108 | 9 | | 5 | Vicuna-33B | 1075 | 10 | | 6 | Llama 2 70B-chat | 1065 | 9 | | 7 | WizardLM-30B | 1048 | 12 | | 8 | Vicuna-13B | 1042 | 10 | | 9 | Llama 2 13B-chat | 1025 | 11 | | 10 | MPT-30B-chat | 1010 | 14 |
Source: LMSYS Chatbot Arena leaderboard, accessed September 2023.
GPT-4 sits 100+ Elo points above everything else. In chess terms, that's the difference between a grandmaster and a strong amateur. The #2-#4 cluster (Claude, GPT-3.5) is tightly packed. Then there's a gap to the open source models.
What Elo actually means in practice
Elo rating differences translate to expected win rates:
| Elo gap | Expected win rate for higher-rated model | |---------|----------------------------------------| | 0 | 50% | | 50 | 57% | | 100 | 64% | | 150 | 70% | | 200 | 76% | | 250 | 81% |
So GPT-4 (1256) vs GPT-3.5-turbo (1130) = 126 Elo gap = ~66% expected win rate for GPT-4. That feels about right from my own testing. GPT-4 is better, but GPT-3.5 wins a third of the time.
Claude v1.3 (1155) vs GPT-3.5-turbo (1130) = 25 Elo gap = ~54% win rate for Claude. Basically a coin flip with a slight edge to Claude. That also matches my experience.
Why this works better than traditional benchmarks
I've been thinking about this comparison a lot:
| Feature | Traditional benchmarks (MMLU etc.) | LMSYS Chatbot Arena | |---------|-----------------------------------|--------------------| | Test data | Fixed, public, known in advance | Changing, user-generated | | Gaming potential | High (train on test set) | Low (can't predict prompts) | | Evaluator | Automatic (pattern matching) | Human judgment | | What it measures | Specific skills (knowledge, reasoning) | Overall perceived quality | | Sample bias | None (full test set) | Heavy (self-selected internet users) | | Cost | Free | Free for users, hosting costs for LMSYS | | Reproducibility | Perfect | Approximate (depends on voter pool) |
Traditional benchmarks are reproducible and precise. The Arena is messy and approximate. But traditional benchmarks can be gamed (as I wrote about regarding the Hugging Face leaderboard). The Arena can't, because the "test" is whatever a random human types.
The problems with the Arena
It's not perfect. Three issues I see in the data:
1. Voter demographics are skewed.
The Arena's users are disproportionately:
- Male (estimated 80%+)
- Technical (developers, researchers)
- English-speaking
- Already familiar with AI
This means the Elo ratings reflect "what tech-savvy English speakers prefer," not "what the general public prefers." For a customer support chatbot serving non-technical users, the Arena rankings might not apply.
2. Prompt distribution is uneven.
| Prompt category (my estimate from public samples) | Estimated % of votes | |--------------------------------------------------|---------------------| | Coding/technical | 30-35% | | Creative writing | 15-20% | | General knowledge | 15-20% | | Reasoning/math | 10-15% | | Roleplay/fun | 10-15% | | Other | 5-10% |
Source: My analysis of publicly shared Arena prompts and LMSYS research paper.
Coding is overrepresented (relative to general population use) because the voter base skews technical. This benefits GPT-4, which excels at code. Claude might rank higher if the prompt distribution had more creative writing and long-form tasks.
3. Position bias.
LMSYS's own research found a slight bias toward the model presented on the left side of the screen. They've attempted to correct for this statistically, but it introduces noise. With 200K votes, the position bias effect on Elo ratings is small (maybe 5-10 points), but it's not zero.
Arena vs MMLU: where they disagree
Here's the interesting comparison. How do Arena rankings compare to MMLU scores?
| Model | Arena Elo rank | MMLU rank | Disagree? | |-------|---------------|-----------|-----------| | GPT-4 | #1 | #1 | No | | Claude v1.3 | #2 | #4 | Yes | | GPT-3.5-turbo | #3 | #2 | Yes | | Llama 2 70B | #6 | #3 | Yes |
Claude ranks #2 in the Arena but lower on MMLU. Llama 2 70B ranks #3 on MMLU but #6 in the Arena. What's happening?
MMLU measures knowledge. The Arena measures overall user satisfaction. Users care about more than factual knowledge: they care about helpfulness, tone, formatting, and how well the model follows instructions. Claude's conversational quality pushes it up in the Arena. Llama 2's strong knowledge scores don't fully translate to user preference.
This disagreement is exactly why we need both types of evaluation. Benchmarks tell you what a model can do. The Arena tells you what users prefer. Those are different things.
My take
The Chatbot Arena isn't replacing MMLU or HellaSwag. But it should be weighted equally or more in model comparisons. It's the closest thing we have to "what real users think," and that's ultimately what matters for product decisions.
I'm now tracking Arena Elo changes monthly alongside benchmark scores. When they agree, I'm confident. When they disagree, I dig deeper. That's the right approach.
200K votes is a solid sample. By year-end, I expect it to pass 500K. At that scale, the noise decreases, the demographic biases become measurable and correctable, and the Elo ratings become genuinely authoritative.
My morning data ritual now starts with checking the Arena leaderboard before anything else. It's become that useful.
If you found this interesting, you might also like:
- Every AI benchmark from 2020, ranked by how much they actually tell you
- DALL-E 2 is out. I ran 200 prompts and measured the results.
- InstructGPT and RLHF: what the training data tells us
- The Chinchilla scaling laws changed everything. Let me show you why.
- I ran GPT-3 on the same 50 questions every month for a year. Here's the drift.
-- dataku