Qwen3 and the Chinese model wave: benchmarking 5 models from China
Qwen3, DeepSeek V3, Yi-Lightning, Baichuan 4, and MiniMax-01. I benchmarked all five against Claude 3.7 Sonnet and GPT-4o. Chinese models now occupy 3 of the top 10 spots on Chatbot Arena. The geographic distribution of AI talent is shifting.
Chinese AI labs released more frontier-class models in Q1 2025 than any other country except the US.
I benchmarked five of them against the Western leaders. The results tell a story about where AI capability is heading geographically.
The five models
| Model | Organization | Parameters | Architecture | Open weight? | |-------|-------------|-----------|-------------|-------------| | Qwen3 235B | Alibaba/Qwen | 235B (22B active) | MoE | Yes | | DeepSeek V3 | DeepSeek | 671B (37B active) | MoE | Yes | | Yi-Lightning | 01.AI | Unknown | Dense | No | | Baichuan 4 | Baichuan | Unknown | Unknown | No | | MiniMax-01 | MiniMax | 456B (45.9B active) | MoE | Yes |
Benchmark comparison
| Benchmark | Qwen3 235B | DeepSeek V3 | Yi-Lightning | Baichuan 4 | MiniMax-01 | Claude 3.7 Sonnet | GPT-4o | |-----------|-----------|-------------|-------------|-----------|-----------|-------------------|--------| | MMLU | 88.4% | 87.1% | 85.8% | 82.3% | 86.2% | 89.4% | 88.7% | | HumanEval | 88.2% | 82.6% | 79.4% | 74.1% | 80.8% | 95.2% | 90.2% | | MATH | 81.3% | 61.6% | 58.2% | 52.4% | 63.7% | 78.3% | 76.6% | | GPQA | 62.8% | 59.1% | 51.3% | 47.6% | 55.2% | 62.1% | 53.6% | | SWE-bench | 38.4% | 42.0% | N/A | N/A | 32.1% | 52.4% | 33.2% | | IFEval | 85.7% | 86.2% | 82.4% | 79.8% | 83.6% | 88.3% | 85.4% |
Sources: Model technical reports, LMSYS Chatbot Arena, Hugging Face, provider API testing.
Qwen3 is the strongest Chinese model on aggregate. On MMLU (88.4%) it's within 1 point of Claude 3.7 Sonnet. On GPQA (62.8%) it actually edges Claude (62.1%).
DeepSeek V3 leads on SWE-bench among the Chinese models (42.0%), and its IFEval score (86.2%) is competitive with the best.
Yi-Lightning and Baichuan 4 are a tier below. Not bad, but not frontier.
MiniMax-01 at 86.2% MMLU and 63.7% MATH is competitive. This lab gets less attention than DeepSeek or Qwen but the numbers are solid.
CJK language performance
Where Chinese models really shine: non-English tasks.
| Test (Chinese) | Qwen3 | DeepSeek V3 | Claude 3.7 Sonnet | GPT-4o | |----------------|-------|-------------|-------------------|--------| | C-Eval (Chinese exam) | 92.3% | 90.1% | 82.4% | 83.7% | | CMMLU | 91.8% | 89.4% | 80.1% | 81.9% | | Chinese coding (custom) | 88% | 86% | 78% | 80% |
| Test (Japanese) | Qwen3 | DeepSeek V3 | Claude 3.7 Sonnet | GPT-4o | |----------------|-------|-------------|-------------------|--------| | JLPT N1 (custom) | 89% | 85% | 82% | 84% | | Japanese QA | 86% | 84% | 79% | 81% |
Sources: C-Eval benchmark, CMMLU, my custom Chinese/Japanese test sets.
On Chinese-language tasks, Qwen3 and DeepSeek V3 beat both Claude and GPT-4o by 8-12 percentage points. This isn't surprising (they're trained on more Chinese data), but the gap is larger than I expected.
For anyone building products for Chinese, Japanese, or Korean markets, these models are clearly the better choice.
Chatbot Arena positions
| Rank (top 10) | Model | Origin | Elo | |------|-------|--------|-----| | 1 | Claude Opus 4 | US (Anthropic) | 1288 | | 2 | Gemini 2.5 Pro | US (Google) | 1282 | | 3 | Claude 3.7 Sonnet | US (Anthropic) | 1278 | | 4 | GPT-4o | US (OpenAI) | 1268 | | 5 | Grok 3 | US (xAI) | 1264 | | 6 | DeepSeek V3 | China | 1258 | | 7 | Qwen3 235B | China | 1256 | | 8 | DeepSeek R1 | China | 1255 | | 9 | Llama 4 Maverick | US (Meta) | 1248 | | 10 | MiniMax-01 | China | 1242 |
Sources: LMSYS Chatbot Arena, April 2025.
Three Chinese models in the top 10. A year ago, there were zero.
The geographic distribution is shifting. US labs still hold the top 5 spots, but the gap between #5 and #8 is only 9 Elo points. On any given benchmark, at least one Chinese model beats at least one US model.
The cost advantage
| Model | Input/M tokens | Output/M tokens | Origin | |-------|---------------|-----------------|--------| | DeepSeek V3 | $0.27 | $1.10 | China | | Qwen3 (via API) | $0.40 | $1.60 | China | | MiniMax-01 | $0.30 | $1.20 | China | | GPT-4o | $2.50 | $10.00 | US | | Claude 3.7 Sonnet | $3.00 | $15.00 | US |
Chinese models are 6-10x cheaper than US counterparts. The cost advantage comes from lower compute costs (H800s are cheaper than H100s in China) and more efficient architectures (MoE is standard practice).
For cost-sensitive applications that don't require peak English performance, the Chinese model market offers genuine value.
My observation
The narrative of "US leads AI, China follows" is increasingly inaccurate. On pure benchmarks, US models still hold the overall top spots. But on efficiency, cost, CJK languages, and rate of improvement, Chinese labs are either leading or competitive.
The AI world is becoming genuinely multi-polar. I track models from 12 countries now. Two years ago, it was basically US + UK (DeepMind).
My "country of origin" column in the model tracking spreadsheet used to be boring. It's not boring anymore.
If you found this interesting, you might also like:
- Llama 2 is here and it's actually good. My benchmark data.
- Mistral Large vs GPT-4 vs Claude 3 Opus: the three-way benchmark
- I benchmarked 12 coding assistants. Cursor is not what I expected.
- Qwen 2.5 is the best open source model nobody is talking about
- Claude 3.5 Sonnet (new) and computer use: my first benchmark data
-- dataku