Benchmark AnalysisMarch 31, 20255 min read

I benchmarked 8 reasoning models on the same 100 math problems

o1, o3-mini, DeepSeek R1, Claude 3.7 Sonnet (thinking), Gemini 2.5 Pro, Grok 3, QwQ-32B, and Phi-4. Same 100 MATH problems. Same evaluation criteria. The spread is tighter than you'd expect from the marketing.

Eight reasoning models. One hundred math problems. Same evaluation criteria.

I've been wanting to do this comparison for months. The reasoning model space went from 1 player (o1) to 8+ in under six months. Everyone claims they match o1. I wanted the data.

The models tested

| Model | Provider | Active params | Type | |-------|----------|-------------|------| | o1 | OpenAI | Unknown | Closed, reasoning | | o3-mini | OpenAI | Unknown | Closed, reasoning (small) | | DeepSeek R1 | DeepSeek | 37B (MoE) | Open, reasoning | | Claude 3.7 Sonnet (thinking) | Anthropic | Unknown | Closed, hybrid | | Gemini 2.5 Pro (thinking) | Google | Unknown | Closed, hybrid | | Grok 3 (thinking) | xAI | Unknown | Closed, reasoning | | QwQ-32B | Alibaba/Qwen | 32B | Open, reasoning | | Phi-4 (reasoning) | Microsoft Research | 14B | Open, small reasoning |

The test set

100 problems from the MATH benchmark, stratified by difficulty:

| Difficulty | Count | Description | |-----------|-------|-------------| | Level 1-2 | 20 | High school algebra, basic geometry | | Level 3 | 25 | Pre-calculus, intermediate algebra | | Level 4 | 30 | Competition math, number theory | | Level 5 | 25 | Hardest MATH problems, Olympiad-adjacent |

Results

| Model | Level 1-2 | Level 3 | Level 4 | Level 5 | Overall | |-------|-----------|---------|---------|---------|---------| | o1 | 100% | 96% | 90% | 76% | 90.0% | | DeepSeek R1 | 100% | 100% | 93% | 80% | 93.0% | | Claude 3.7 (thinking) | 100% | 96% | 90% | 72% | 89.0% | | Gemini 2.5 Pro (thinking) | 100% | 96% | 87% | 72% | 88.0% | | o3-mini | 100% | 92% | 83% | 64% | 84.0% | | Grok 3 (thinking) | 100% | 92% | 80% | 60% | 82.0% | | QwQ-32B | 95% | 88% | 77% | 52% | 77.0% | | Phi-4 (reasoning) | 90% | 80% | 60% | 32% | 64.0% |

DeepSeek R1 takes the overall crown at 93%. o1 is second at 90%. Claude 3.7 Sonnet with thinking is close behind at 89%.

The separation happens at Level 5 (hardest problems). R1 solves 80% of them. o1 solves 76%. Claude solves 72%. Everyone else drops off.

Phi-4 at 14B parameters scoring 64% overall is actually remarkable for its size. At Level 1-2, it hits 90%. It only falls apart on hard competition math.

Cost per problem

| Model | Avg tokens per problem | Avg cost per problem | Cost per correct answer | |-------|----------------------|---------------------|----------------------| | o1 | 11,200 | $0.84 | $0.93 | | DeepSeek R1 | 7,800 | $0.019 | $0.020 | | Claude 3.7 (thinking) | 8,400 | $0.14 | $0.16 | | Gemini 2.5 Pro (thinking) | 9,100 | $0.11 | $0.13 | | o3-mini | 5,600 | $0.07 | $0.083 | | Grok 3 (thinking) | 10,200 | $0.17 | $0.21 | | QwQ-32B | 6,200 | $0.009 | $0.012 | | Phi-4 (reasoning) | 3,400 | ~$0.001 (self-hosted) | ~$0.002 |

Sources: Provider pricing pages, my token measurements.

DeepSeek R1: best accuracy (93%) at $0.019 per problem. o1: second-best accuracy (90%) at $0.84 per problem. That's a 44x cost difference for a 3-point quality gap.

QwQ-32B at $0.009 per problem with 77% accuracy is the cost-efficiency champion for "good enough" applications.

The thinking token analysis

I measured how many thinking tokens each model uses:

| Model | Avg thinking tokens | Thinking as % of total | |-------|-------------------|----------------------| | o1 | ~8,000 (hidden) | ~71% | | DeepSeek R1 | 5,200 | 67% | | Claude 3.7 (thinking) | 5,800 | 69% | | Gemini 2.5 Pro (thinking) | 6,400 | 70% | | o3-mini | 3,800 | 68% |

All reasoning models spend roughly two-thirds of their tokens on thinking. The ratio is surprisingly consistent across architectures and providers.

o1 appears to think the most (based on total tokens and cost), but since the thinking is hidden, I'm estimating.

What I learned

| Finding | Detail | |---------|--------| | The spread is tighter than marketing suggests | Top 4 models are within 5 points of each other | | DeepSeek R1 is the best overall | Highest accuracy AND lowest cost among large models | | Level 5 problems are the differentiator | Easy problems are solved by everyone | | Cost varies 44x for similar performance | o1 vs R1 is the starkest example | | Small models are surprisingly viable | QwQ-32B and Phi-4 punch above their weight |

The marketing from every provider says "matches o1." My data says: yes, roughly. The top 4 are all in the 88-93% range. The differences are real but small.

If you're choosing a reasoning model for production use, the choice isn't "which one is best" (they're all close). The choice is "which one fits my cost and latency requirements."

My spreadsheet now has 8 reasoning models where it used to have 1. The reasoning model market went from monopoly to commodity in six months. I don't think I've ever seen a technology category commoditize this fast.


If you found this interesting, you might also like:

-- dataku

More from dataku