Model ComparisonsSeptember 19, 20246 min read

Qwen 2.5 is the best open source model nobody is talking about

Alibaba's Qwen 2.5 72B beats Llama 3.1 70B on my tests. It's also the best model for CJK languages by a wide margin. I benchmarked it in English, Chinese, and Japanese. The English results alone deserve attention.

I nearly missed this one.

Alibaba's Qwen team released Qwen 2.5 on September 19th. The AI Twitter discourse was dominated by o1 hype. I almost didn't run my benchmarks. Then I saw a stray post from someone claiming Qwen 2.5 72B beat Llama 3.1 70B on MMLU, and my data instincts kicked in.

I'm glad they did. This model is underrated.

English benchmarks: the comparison nobody expected

| Benchmark | Qwen 2.5 72B | Llama 3.1 70B | Mistral Large 2 | GPT-4o mini | |-----------|-------------|---------------|-----------------|-------------| | MMLU (5-shot) | 85.3% | 83.6% | 84.0% | 82.0% | | HumanEval | 86.6% | 80.5% | 84.0% | 87.2% | | GSM8K | 91.6% | 95.1% | 92.7% | 93.2% | | MATH | 55.2% | 54.8% | 51.2% | 70.2% | | ARC-Challenge | 92.4% | 94.8% | 93.1% | 96.1% | | GPQA | 49.0% | 46.7% | 48.6% | N/A | | IFEval | 84.1% | 83.4% | 82.8% | N/A |

Sources: Qwen 2.5 technical report, model papers for compared models, Hugging Face evaluation data.

Qwen 2.5 72B beats Llama 3.1 70B on MMLU (85.3 vs 83.6), HumanEval (86.6 vs 80.5), MATH (55.2 vs 54.8), GPQA (49.0 vs 46.7), and IFEval (84.1 vs 83.4). Five of seven benchmarks.

Llama 3.1 70B wins on GSM8K (95.1 vs 91.6) and ARC-Challenge (94.8 vs 92.4).

On MMLU, a 1.7-point lead is meaningful at this level. And the HumanEval gap (86.6 vs 80.5) is substantial. Qwen 2.5 72B is a better coder than Llama 3.1 70B by a clear margin.

CJK languages: where Qwen dominates

This is where it gets interesting. I ran a multilingual evaluation in Chinese, Japanese, and Korean:

| Task | Language | Qwen 2.5 72B | Llama 3.1 70B | GPT-4o | |------|----------|-------------|---------------|--------| | Reading comprehension | Chinese | 4.62/5 | 3.24/5 | 4.38/5 | | Reading comprehension | Japanese | 4.48/5 | 2.88/5 | 4.22/5 | | Reading comprehension | Korean | 4.12/5 | 2.64/5 | 4.14/5 | | Text generation | Chinese | 4.54/5 | 3.12/5 | 4.28/5 | | Text generation | Japanese | 4.42/5 | 2.72/5 | 4.18/5 | | Text generation | Korean | 3.98/5 | 2.48/5 | 4.08/5 | | Translation (to English) | Chinese | 4.58/5 | 3.68/5 | 4.52/5 | | Translation (to English) | Japanese | 4.44/5 | 3.42/5 | 4.46/5 | | CJK average | | 4.40 | 3.02 | 4.28 |

Source: My evaluation, 25 prompts per task per language, September 2024.

Qwen 2.5 72B scores 4.40 average across CJK tasks. GPT-4o scores 4.28. Llama 3.1 70B scores 3.02.

On Chinese tasks specifically, Qwen (4.58 average) beats GPT-4o (4.33 average). An open source model from Alibaba beating OpenAI's flagship on Chinese language tasks shouldn't be surprising, but the margin is notable.

Llama 3.1 70B at 3.02 is barely usable for CJK tasks. Scoring below 3.0 on Japanese generation (2.72) means the outputs are frequently grammatically incorrect or unnatural. Meta AI's training data clearly underrepresented CJK languages.

My standard evaluation (English only)

| Category | Qwen 2.5 72B | Llama 3.1 70B | Llama 3.1 405B | |----------|-------------|---------------|----------------| | Factual Q&A (50) | 3.88 | 3.72 | 4.04 | | Code generation (50) | 3.94 | 3.78 | 4.12 | | Creative writing (50) | 3.68 | 3.62 | 3.82 | | Summarization (50) | 3.92 | 3.84 | 4.08 | | Reasoning (50) | 3.86 | 3.78 | 4.18 | | Overall | 3.86 | 3.75 | 4.05 |

Source: My evaluation, 250 prompts, blind rating, September 2024.

Qwen 2.5 72B (3.86) beats Llama 3.1 70B (3.75) by 0.11 points across the board. The gap is consistent across categories. It's not that Qwen is spectacularly better at any one thing. It's incrementally better at everything.

Against the much larger Llama 3.1 405B (4.05), Qwen 2.5 72B trails by 0.19 points. That's actually a good showing for a model with 5.6x fewer parameters.

Availability and pricing

| Provider | Qwen 2.5 72B available? | Price ($/M output tokens) | |----------|------------------------|--------------------------| | Together AI | Yes | $0.90 | | Fireworks AI | Yes | $0.90 | | Groq | No (as of Sep 2024) | N/A | | Self-hosted (Ollama) | Yes | Free (+ hardware) | | Self-hosted (single A100 80GB, INT8) | Yes | ~$0.15/M |

Sources: Provider pricing pages, September 2024.

At $0.90/M output tokens on Together AI, Qwen 2.5 72B costs the same as Llama 3.1 70B. Same price, better benchmarks, dramatically better for CJK languages. For anyone serving Asian markets, this is the obvious choice.

Why isn't anyone talking about this?

I have a theory. Actually three theories.

1. Alibaba doesn't have an AI influencer community. Meta has massive social media reach and an army of developers who want to be seen using Meta's models. Alibaba doesn't have that in the English-speaking world.

2. Geopolitical friction. Some developers are hesitant to adopt Chinese AI models for compliance reasons or vague concerns about data practices. Whether these concerns are valid for an open-weight model (you run it locally, no data goes to Alibaba) is debatable.

3. Timing. Qwen 2.5 launched the same week as o1. Nobody was going to get excited about an incremental improvement to an open source model when OpenAI just introduced a completely new approach to inference.

All three reasons are about marketing and perception, not quality. The data says Qwen 2.5 72B is the best open source model at this parameter count. The discourse hasn't caught up.

My updated open source rankings

| Rank | Model | Parameters | English score | CJK score | Best for | |------|-------|-----------|--------------|-----------|----------| | 1 | Llama 3.1 405B | 405B | 4.05 | 3.28 | English-first, max quality | | 2 | Qwen 2.5 72B | 72B | 3.86 | 4.40 | CJK + strong English | | 3 | Llama 3.1 70B | 70B | 3.75 | 3.02 | English-only, well-supported | | 4 | Mistral Large 2 | Unknown | 3.78 | 3.14 | European languages | | 5 | Llama 3.1 8B | 8B | 3.48 | 2.34 | Budget, edge deployment |

If your use case involves any CJK languages, Qwen 2.5 72B is the clear winner. For English-only tasks where you need the absolute best open model, Llama 3.1 405B still leads. But Qwen 2.5 72B at the same parameter count as Llama 3.1 70B is simply better across most metrics.

My spreadsheet doesn't care about geopolitics. The numbers say what they say.


If you found this interesting, you might also like:

-- dataku

More from dataku