Model ComparisonsFebruary 26, 20246 min read

Mistral Large vs GPT-4 vs Claude 3 Opus: the three-way benchmark

Mistral finally has a frontier model. I ran all three through my standard 300-prompt evaluation. Mistral Large is competitive but not quite there. The interesting part is where it wins: European languages.

Mistral AI just dropped their first real frontier model. Not a scrappy 7B underdog. Not a clever MoE trick. A proper, big, closed-source model positioned against GPT-4 and Claude 3 Opus.

This is the fight I've been waiting to benchmark.

The contenders

| | Mistral Large | GPT-4 Turbo | Claude 3 Opus | |---|------|------------|---------| | Provider | Mistral AI | OpenAI | Anthropic | | Release | Feb 2024 | Nov 2023 | Mar 2024 | | Parameters | Unknown | Unknown | Unknown | | Context window | 32K | 128K | 200K | | Input price ($/M tokens) | $8.00 | $10.00 | $15.00 | | Output price ($/M tokens) | $24.00 | $30.00 | $75.00 |

Sources: Official pricing pages for each provider, February/March 2024.

Three frontier models from three different continents (Europe, US West Coast, US West Coast again). All closed-source. All priced in the $8-75 per million tokens range.

My standard 300-prompt evaluation

I've been running the same evaluation since mid-2023. 300 prompts across 6 categories, 50 per category. Each response is rated 1-5 by me (blind to which model produced it). Yes, it takes forever. Yes, I do it anyway.

| Category | Mistral Large | GPT-4 Turbo | Claude 3 Opus | Winner | |----------|--------------|-------------|---------------|--------| | Factual Q&A (50) | 3.72 | 4.12 | 4.08 | GPT-4 Turbo | | Code generation (50) | 3.61 | 4.24 | 4.18 | GPT-4 Turbo | | Creative writing (50) | 3.84 | 3.92 | 4.31 | Claude 3 Opus | | Summarization (50) | 3.89 | 4.06 | 4.21 | Claude 3 Opus | | Reasoning/logic (50) | 3.54 | 4.15 | 4.09 | GPT-4 Turbo | | Instruction following (50) | 3.91 | 4.18 | 4.26 | Claude 3 Opus | | Overall average | 3.75 | 4.11 | 4.19 | Claude 3 Opus |

Source: My evaluation, 300 prompts, blind rating, February-March 2024.

Claude 3 Opus edges out GPT-4 Turbo at 4.19 vs 4.11. But look at Mistral Large: 3.75 overall. That's competitive, but there's a clear gap. About 0.36 points behind GPT-4 Turbo and 0.44 behind Claude 3 Opus.

For a European startup competing against companies with 10-100x their resources, 3.75 is impressive. But it's not at parity. Not yet.

Where Mistral Large actually wins

Here's where it gets interesting. I added a seventh category that I don't normally test: European languages. 50 prompts each in French, German, and Spanish.

| Language task | Mistral Large | GPT-4 Turbo | Claude 3 Opus | |-------------|--------------|-------------|---------------| | French comprehension | 4.28 | 4.02 | 3.94 | | French generation | 4.34 | 3.88 | 3.91 | | German comprehension | 4.12 | 4.06 | 3.82 | | German generation | 4.08 | 3.92 | 3.78 | | Spanish comprehension | 4.18 | 4.14 | 4.02 | | Spanish generation | 4.22 | 3.96 | 3.88 | | European avg | 4.20 | 4.00 | 3.89 |

Source: My evaluation, 50 prompts per language, blind rating.

Mistral Large is the best model for European languages by a meaningful margin. 4.20 vs GPT-4 Turbo's 4.00 vs Claude 3 Opus's 3.89. On French specifically, the gap is even wider: 4.34 vs 3.88 (GPT-4 Turbo).

This makes sense. Mistral AI is a French company. Their training data almost certainly emphasizes European languages more heavily than the US-based competitors.

For European businesses, this is a real differentiator. If 40% of your users interact in French, German, or Spanish, Mistral Large isn't a compromise. It's the best choice.

The pricing picture

| Model | $/M output tokens | My overall score | Score per dollar | |-------|-------------------|-----------------|-----------------| | Mistral Large | $24.00 | 3.75 | 0.156 | | GPT-4 Turbo | $30.00 | 4.11 | 0.137 | | Claude 3 Opus | $75.00 | 4.19 | 0.056 | | Mixtral 8x7B | $0.60 | 3.28* | 5.467 |

Sources: Official pricing, my evaluation scores. *Mixtral 8x7B score from my December 2023 evaluation using same methodology.

On pure score-per-dollar, Mistral Large actually beats GPT-4 Turbo: 0.156 vs 0.137. You get slightly less quality per prompt, but more quality per dollar.

Claude 3 Opus is the most expensive by far, and on a value basis, it's 2.8x worse than Mistral Large. The only reason to pick Opus is if you need the absolute best output quality and cost is secondary.

And Mixtral 8x7B remains the value king at 5.467 score per dollar. But the quality gap to the frontier models is substantial (3.28 vs 4.19).

The LMSYS Chatbot Arena data

My evaluation is just one person's ratings. Let's check the crowd:

| Model | LMSYS Elo (Feb 2024) | Rank | |-------|------|------| | GPT-4 Turbo | 1256 | #1 | | Claude 3 Opus | 1249 | #2 | | Mistral Large | 1218 | #5 | | Gemini Pro | 1208 | #7 | | Mixtral 8x7B | 1162 | #12 |

Source: LMSYS Chatbot Arena, February 2024. Elo ratings based on blind user comparisons.

The LMSYS data tells a similar story. Mistral Large at #5 is respectable but clearly behind the top 2. The 38-point Elo gap between Mistral Large (1218) and GPT-4 Turbo (1256) is significant. In practice, it means GPT-4 Turbo would win a head-to-head comparison about 55% of the time.

My take

Mistral AI built a frontier model in 18 months with a team of roughly 60 people. That is genuinely remarkable.

But "remarkable for a small team" and "best model" are different things. Mistral Large is a solid #5 globally. For European languages, it's arguably #1. For everything else, GPT-4 Turbo and Claude 3 Opus are better.

The path for Mistral is clear: keep improving the base model while maintaining the European language edge. If they can close the 0.36-point gap on my evaluation (from 3.75 to 4.11) while keeping the European language lead, they have a genuinely differentiated product.

I'm rooting for them. A three-way frontier model race is better for everyone than a two-way one.


If you found this interesting, you might also like:

-- dataku

More from dataku