Benchmark AnalysisSeptember 29, 20255 min read

My monthly benchmark dashboard: September 2025 update

Monthly update to my running comparison of 15 models across 8 benchmarks. Big movers: Gemini 2.5 Pro gained 8 points on MMLU-Pro. Claude Opus 4 still leads on HumanEval. New entrant: Mistral Large 3.

Monthly dashboard time. Here's where 15 models stand on 8 benchmarks as of September 2025.

The full dashboard

| Model | MMLU | MMLU-Pro | HumanEval | MATH | GPQA | SWE-bench V | Chatbot Arena | LiveCodeBench | |-------|------|---------|-----------|------|------|-------------|---------------|--------------| | Claude Opus 4 | 91.2 | 78.4 | 97.1 | 96.8 | 76.3 | 58.7 | 1288 | 73.4 | | Gemini 2.5 Pro | 89.0 | 80.2 | 95.1 | 97.1 | 74.1 | 41.3 | 1282 | 68.2 | | Claude 3.7 Sonnet | 89.4 | 74.8 | 95.2 | 96.2 | 74.8 | 52.4 | 1278 | 70.1 | | GPT-4o | 88.7 | 72.4 | 90.2 | 76.6 | 53.6 | 33.2 | 1271 | 55.8 | | Grok 3 | 89.2 | 71.8 | 89.7 | 82.4 | 61.3 | 39.8 | 1268 | 52.1 | | DeepSeek R1 | 90.8 | 76.2 | 92.6 | 97.3 | 71.5 | 49.2 | 1255 | 65.9 | | Qwen3 235B | 88.4 | 73.1 | 88.2 | 81.3 | 62.8 | 38.4 | 1256 | 51.4 | | Llama 4 Maverick | 85.5 | 68.4 | 82.3 | 77.9 | 61.8 | 37.1 | 1248 | 48.5 | | Mistral Large 3 | 87.6 | 71.2 | 86.8 | 74.2 | 58.1 | 34.2 | 1238 | 46.8 | | o3 | 91.4 | 79.8 | 94.8 | 97.0 | 75.1 | 51.8 | 1262 | 68.4 | | o3-mini | 86.2 | 70.4 | 88.4 | 91.2 | 65.3 | 38.6 | 1245 | 54.2 | | DeepSeek V3 | 87.1 | 68.8 | 82.6 | 61.6 | 59.1 | 42.0 | 1258 | 42.8 | | GPT-4.5 | 91.8 | 78.1 | 93.1 | 81.6 | 65.0 | 42.8 | 1268 | 58.3 | | Gemini 2.5 Flash | 85.8 | 66.2 | 84.1 | 78.4 | 54.2 | 28.4 | 1235 | 42.6 | | Claude 4 Sonnet | 89.8 | 75.2 | 94.8 | 88.2 | 68.4 | 50.1 | 1275 | 66.8 |

Sources: LMSYS Chatbot Arena, Anthropic, OpenAI, Google, Mistral AI, Artificial Analysis, SWE-bench, my evaluations. Bold = new entry or significant change.

Movers this month

| Model | Benchmark | Change | Notes | |-------|-----------|--------|-------| | Gemini 2.5 Pro | MMLU-Pro | +8.0 | Major jump, now leads this benchmark | | Mistral Large 3 | All | New entry | Mistral's flagship update | | Claude 4 Sonnet | SWE-bench V | +2.3 | Quiet update improved coding | | Grok 3 | Chatbot Arena | +4 | Climbing slowly |

Gemini 2.5 Pro's 8-point jump on MMLU-Pro is the biggest move this month. Google appears to have fine-tuned for this harder benchmark variant, and it paid off. MMLU-Pro is the metric I'm paying more attention to now that standard MMLU is saturated.

Mistral Large 3 debuts at solid but not spectacular numbers. 87.6% MMLU, 86.8% HumanEval. It's in the "good, not frontier" tier. Competitive with Qwen3 and Llama 4 Maverick.

Category leaders

| Category | Leader | Score | |----------|--------|-------| | General knowledge (MMLU) | GPT-4.5 | 91.8% | | Hard knowledge (MMLU-Pro) | Gemini 2.5 Pro | 80.2% | | Code generation (HumanEval) | Claude Opus 4 | 97.1% | | Math (MATH 500) | DeepSeek R1 | 97.3% | | Science (GPQA Diamond) | Claude Opus 4 | 76.3% | | Bug fixing (SWE-bench V) | Claude Opus 4 | 58.7% | | Human preference (Arena) | Claude Opus 4 | 1288 | | Real coding (LiveCodeBench) | Claude Opus 4 | 73.4% |

Claude Opus 4 leads 5 of 8 categories. But it's not the cheapest option in any of them. The "best vs most cost-efficient" tension continues.

Month-over-month trends

| Trend | Evidence | |-------|---------| | Coding benchmarks getting closer | Top 3 within 2.3 points on HumanEval | | MMLU is saturated | 4 models above 90%, differences are noise | | Reasoning models dominate math | R1, o3, Opus 4 (thinking) all above 96.5% | | SWE-bench spread is large | 58.7% to 28.4% between best and worst tested | | Arena is the best overall signal | Correlates with real-world usage preferences |

Next update in October. I expect Claude Opus 4.5 to appear by then, which could shuffle the top of every category.


If you found this interesting, you might also like:

-- dataku

More from dataku