Benchmark AnalysisMarch 24, 20265 min read

My monthly benchmark dashboard: March 2026 update

Monthly tracker updated. Claude Opus 4.5 still leads coding. Gemini 2.5 Ultra leads multimodal. o3 leads hard math. DeepSeek R2 leads cost-efficiency. New benchmark added: GPQA Diamond (graduate-level science questions). Full table inside.

Monthly dashboard. Here's where the top 12 models stand across 9 benchmarks as of March 2026.

The full dashboard

| Model | MMLU | MMLU-Pro | HumanEval | MATH | GPQA Diamond | SWE-bench V | Arena Elo | LiveCode | ChartQA | |-------|------|---------|-----------|------|-------------|-------------|-----------|----------|---------| | Claude Opus 4.5 | 92.4 | 80.8 | 98.2 | 98.4 | 79.8 | 64.2 | 1298 | 78.6 | 89.4 | | Gemini 2.5 Ultra | 92.1 | 81.4 | 96.4 | 98.2 | 80.2 | 52.8 | 1294 | 71.2 | 94.2 | | Claude 4 Sonnet | 90.2 | 76.8 | 95.4 | 89.8 | 70.2 | 51.8 | 1278 | 68.4 | 86.2 | | DeepSeek R2 | 91.2 | 78.4 | 95.8 | 98.1 | 78.4 | 54.6 | 1272 | 72.4 | N/A | | o3 | 91.8 | 80.2 | 95.2 | 97.4 | 76.8 | 52.4 | 1266 | 69.8 | 84.8 | | GPT-4o (Mar) | 89.8 | 74.2 | 92.4 | 79.8 | 58.2 | 38.4 | 1272 | 60.2 | 83.8 | | Grok 3 | 89.6 | 72.8 | 90.8 | 83.4 | 62.4 | 41.2 | 1268 | 54.8 | 80.2 | | Qwen3 235B | 88.8 | 74.6 | 89.4 | 82.8 | 64.2 | 40.1 | 1260 | 53.2 | 78.4 | | Llama 4 Maverick | 86.2 | 70.4 | 84.8 | 79.2 | 62.8 | 38.8 | 1252 | 50.4 | 74.6 | | DeepSeek V4 | 88.4 | 72.2 | 86.2 | 68.4 | 62.1 | 44.8 | 1262 | 46.8 | N/A | | o4-mini | 87.4 | 71.8 | 88.6 | 84.2 | 61.4 | 36.2 | 1248 | 52.6 | 78.2 | | Gemini 2.5 Flash | 86.4 | 68.2 | 85.8 | 80.4 | 56.8 | 30.8 | 1240 | 44.8 | 82.4 |

Sources: LMSYS Chatbot Arena, Anthropic, OpenAI, Google, DeepSeek, Artificial Analysis, benchmark papers and leaderboards.

Category leaders (March 2026)

| Category | Leader | Score | |----------|--------|-------| | General knowledge (MMLU) | Claude Opus 4.5 | 92.4% | | Hard knowledge (MMLU-Pro) | Gemini 2.5 Ultra | 81.4% | | Code generation (HumanEval) | Claude Opus 4.5 | 98.2% | | Math (MATH 500) | Claude Opus 4.5 | 98.4% (edge over R2's 98.1%) | | Science (GPQA Diamond) | Gemini 2.5 Ultra | 80.2% | | Bug fixing (SWE-bench V) | Claude Opus 4.5 | 64.2% | | Human preference (Arena) | Claude Opus 4.5 | 1298 | | Real coding (LiveCodeBench) | Claude Opus 4.5 | 78.6% | | Vision (ChartQA) | Gemini 2.5 Ultra | 94.2% |

Claude Opus 4.5 leads 6 of 9 categories. Gemini 2.5 Ultra leads the other 3 (MMLU-Pro, GPQA, ChartQA).

Movers this month

| Model | Change | Notes | |-------|--------|-------| | Gemini 2.5 Ultra | New entry | Took 3 category crowns | | GPT-4o (Mar) | +2 Arena Elo | Minor update | | DeepSeek V4 | New entry | Base model for R2 |

Cost-efficiency rankings

| Model | Avg quality | Cost per query | Cost per quality point | |-------|-----------|----------------|----------------------| | Gemini 2.5 Flash | 72.4% | $0.00012 | $0.00017 | | DeepSeek R2 | 84.6% | $0.0008 | $0.00095 | | o4-mini | 76.2% | $0.0024 | $0.0031 | | Claude 4 Sonnet | 82.4% | $0.0088 | $0.011 | | GPT-4o | 78.4% | $0.0058 | $0.0074 | | Claude Opus 4.5 | 90.2% | $0.068 | $0.075 | | Gemini 2.5 Ultra | 88.4% | $0.018 | $0.020 |

Cost-efficiency champion: Gemini 2.5 Flash at $0.00017 per quality point. That's 441x more efficient than Claude Opus 4.5.

But if you need 90%+ quality, Claude Opus 4.5 is your only option. You pay 441x more for 18 quality points. That's the price of excellence.

Six-month trend

The frontier continues to compress. The gap between #1 and #5 on Arena narrowed from 20 Elo in July 2025 to 32 Elo in March 2026. Wait, that's actually wider.

Let me recheck. Opus 4.5 at 1298, #5 is o3 at 1266. That's 32 points. In July 2025 it was 20 points.

The gap actually widened because Opus 4.5 pulled away from the pack. The frontier didn't converge this quarter. It stratified into two tiers: Opus 4.5 and Gemini Ultra at the top, then a cluster of 5 models within 6 Elo points below them.

I'll be watching to see if the next generation of releases compresses the gap again or if two-tier stratification becomes the norm.

Next update in April.


If you found this interesting, you might also like:

-- dataku

More from dataku