Benchmark AnalysisMay 27, 20246 min read

The LMSYS Elo gap between open and closed source models just shrank to 50 points

In January 2023, the Elo gap between the best open source model and GPT-4 was 200+ points. It's now about 50. I charted the convergence curve. At this rate, parity arrives in Q3 2024.

I update a chart every month. Best open source model Elo vs best closed source model Elo on LMSYS Chatbot Arena. The line has been converging for 18 months.

This month, the gap hit 50 points. I want to show you what that means.

The convergence timeline

| Date | Best closed source model | Elo | Best open source model | Elo | Gap | |------|------------------------|-----|----------------------|-----|-----| | Jan 2023 | GPT-4 (early) | ~1260 | Vicuna-13B | ~1050 | ~210 | | Apr 2023 | GPT-4 | ~1260 | Vicuna-33B | ~1090 | ~170 | | Jul 2023 | GPT-4 | ~1255 | Llama 2 70B Chat | ~1115 | ~140 | | Oct 2023 | GPT-4 Turbo | ~1262 | Mistral 7B Instruct | ~1135 | ~127 | | Jan 2024 | GPT-4 Turbo | ~1260 | Mixtral 8x7B | ~1168 | ~92 | | Mar 2024 | Claude 3 Opus | ~1270 | Llama 3 70B | ~1205 | ~65 | | May 2024 | GPT-4o | ~1285 | Llama 3 70B Instruct | ~1234 | ~51 |

Sources: LMSYS Chatbot Arena leaderboard, monthly snapshots, my tracking data. Elo numbers approximate based on public leaderboard.

From a 210-point gap to a 51-point gap in 16 months. The convergence is steady. About 10 points per month.

What Elo gaps mean in practice

Most people see "50 points" and don't know if that's a lot. Let me translate:

| Elo gap | Expected win rate (higher-rated model) | Practical meaning | |---------|---------------------------------------|-------------------| | 200 points | 76% | Higher model clearly better in most conversations | | 150 points | 70% | Noticeable quality gap, but lower model has good moments | | 100 points | 64% | Gap exists but takes multiple conversations to feel it | | 50 points | 57% | Coin flip territory. Many users can't tell the difference. | | 25 points | 54% | Statistically distinguishable, practically identical | | 0 points | 50% | Parity |

Source: Elo rating mathematics, standard conversion formula.

At 50 points, the best closed source model (GPT-4o) would win a head-to-head comparison against the best open source model (Llama 3 70B) about 57% of the time. In other words: if you showed a random user two responses, one from each model, they'd pick GPT-4o only slightly more often than chance.

For many use cases, 57% vs 43% win rate isn't worth paying 5-10x more for the closed source API.

The extrapolation (with caveats)

If the gap keeps closing at ~10 points per month:

| Projected date | Projected gap | What that means | |---------------|--------------|-----------------| | Jun 2024 | ~40 points | Hard to distinguish in blind tests | | Aug 2024 | ~20 points | Practically identical for most users | | Oct 2024 | ~0 points | Parity |

I want to be careful here. Linear extrapolation of non-linear processes is a classic mistake. The last 50 points might be much harder to close than the first 150. Closed source models are also improving, so the target is moving.

But the trend direction is unambiguous. Open source is catching up, and the rate of convergence hasn't slowed yet.

What's driving the convergence

Three factors, each contributing roughly equally:

1. Better open source base models. Llama 3 70B is dramatically better than Llama 2 70B (which was better than LLaMA 65B). Each generation closes 30-50 Elo points.

2. Better fine-tuning. The community has gotten extremely good at instruction-tuning and RLHF on open models. A well-tuned Llama 3 70B is much better than the raw base model.

3. Closed source ceiling effects. GPT-4's Elo has been roughly flat (~1255-1285) for over a year. The frontier models keep getting better, but the improvements are smaller. GPT-4o gained about 25 Elo over GPT-4 Turbo. Meanwhile, open source jumped 66 points (Mixtral to Llama 3 70B).

| Time period | Closed source Elo change | Open source Elo change | |-------------|--------------------------|------------------------| | Jan-Jul 2023 | -5 (GPT-4 flat) | +65 (Vicuna to Llama 2) | | Jul 2023-Jan 2024 | +7 (GPT-4 to GPT-4 Turbo) | +53 (Llama 2 to Mixtral) | | Jan-May 2024 | +25 (GPT-4 Turbo to GPT-4o) | +66 (Mixtral to Llama 3 70B) |

Source: LMSYS data, my tracking.

Open source gained 184 Elo points total. Closed source gained 27. That's a 6.8x faster improvement rate for open source. The convergence isn't because closed source stopped improving. It's because open source is improving much faster.

The economic implications

If you accept that the quality gap is now small enough to be irrelevant for most use cases, the cost comparison becomes the whole story:

| Model | Elo (approx) | $/M output tokens | Elo per dollar | |-------|------|--------------------|----| | GPT-4o | ~1285 | $15.00 | 85.7 | | Claude 3 Opus | ~1270 | $75.00 | 16.9 | | Claude 3 Sonnet | ~1225 | $15.00 | 81.7 | | Llama 3 70B (hosted) | ~1234 | $0.90 | 1,371.1 | | Llama 3 70B (self-hosted) | ~1234 | ~$0.40 | 3,085.0 | | Mistral AI Large | ~1218 | $24.00 | 50.8 |

Source: LMSYS leaderboard, provider pricing, May 2024.

The Elo-per-dollar of self-hosted Llama 3 70B is 36x better than GPT-4o. Even hosted through Together AI at $0.90/M tokens, it's 16x better than GPT-4o.

When the quality is within 51 Elo points (57% win rate for the more expensive model) and the price is 16-36x higher, the economic argument for closed source APIs narrows to: convenience, support, and the last 5-7% of quality.

My prediction

I predicted in December 2023 that an open source model would match GPT-4 on LMSYS by the end of 2024. Based on the current trajectory, I think it'll happen by Q3 2024, three months earlier than I predicted.

The Llama 3 405B model is rumored to be in training. If it performs as expected from scaling (matching or exceeding GPT-4o quality), the Elo gap will close to zero.

The question then becomes: what do closed source models offer that justifies their price premium when an open model of equal quality exists? I don't have a good answer yet. But I suspect the answer involves features (tool use, multimodal, system prompt control) rather than raw quality.

My chart will keep updating. The line is still converging. The data doesn't care about business models.


If you found this interesting, you might also like:

-- dataku

More from dataku