Model ComparisonsJanuary 20, 20264 min read

DeepSeek R2: the open source reasoning model that costs pennies

DeepSeek R2 matches o3 on math benchmarks at 1/20th the inference cost. I ran my standard 200-problem reasoning evaluation. R2 scores 91.2% on MATH vs o3's 93.7%. At $0.14 vs $2.80 per hard problem, the economics aren't even close.

DeepSeek just released R2 and the pricing makes R1 look expensive.

R1 was already 40x cheaper than o1. R2 is another 3x cheaper than R1. And the benchmarks are better.

The headline comparison

| Benchmark | DeepSeek R2 | o3 | DeepSeek R1 | Claude Opus 4 (thinking) | |-----------|------------|-----|-------------|-------------------------| | MATH (500) | 98.1% | 97.0% | 97.3% | 96.8% | | GPQA Diamond | 78.4% | 75.1% | 71.5% | 76.3% | | AIME 2024 | 83.2% | 80.8% | 79.8% | 73.6% | | HumanEval | 95.8% | 94.8% | 92.6% | 96.8% | | LiveCodeBench | 72.4% | 68.4% | 65.9% | 70.1% | | SWE-bench Verified | 54.6% | 51.8% | 49.2% | 58.7% | | Codeforces Rating | 2,318 | 2,104 | 2,029 | 1,890 |

Sources: DeepSeek R2 technical report, OpenAI, LMSYS Chatbot Arena, Hugging Face.

DeepSeek R2 beats o3 on MATH (98.1% vs 97.0%), GPQA (78.4% vs 75.1%), AIME (83.2% vs 80.8%), and Codeforces (2,318 vs 2,104).

o3 doesn't lead on any of these benchmarks anymore.

Claude Opus 4 with thinking still leads on SWE-bench Verified (58.7% vs 54.6%) and ties on HumanEval. For real-world bug fixing, Claude remains the best. But on pure reasoning and competition problems, R2 is the new king.

The cost data

| Metric | DeepSeek R2 | o3 | DeepSeek R1 | |--------|-----------|-----|-------------| | Input/M tokens | $0.20 | $15.00 | $0.55 | | Output/M tokens | $0.80 | $60.00 | $2.19 | | Avg tokens per hard problem | 6,200 | 14,000 | 7,800 | | Cost per hard problem | $0.006 | $0.89 | $0.019 | | Cost per correct (hard, 80%+ accuracy) | $0.007 | $1.10 | $0.024 |

Sources: DeepSeek, OpenAI pricing pages, my measurements.

$0.006 per hard reasoning problem. Less than a penny.

o3 costs $0.89 for the same problem. That's a 148x cost difference. And R2 gets more of them right.

R2 is also more token-efficient than R1. It uses 6,200 thinking tokens on average vs R1's 7,800. DeepSeek clearly improved the thinking efficiency, not just the accuracy.

My 200-problem reasoning evaluation

| Category | DeepSeek R2 | o3 | DeepSeek R1 | |----------|-----------|-----|-------------| | Competition math (50) | 88% | 86% | 84% | | Graduate science (50) | 76% | 74% | 68% | | Coding (LeetCode Hard, 50) | 82% | 78% | 76% | | Logical reasoning (50) | 78% | 76% | 72% | | Overall | 81.0% | 78.5% | 75.0% |

R2 leads in all four categories. The improvement from R1 to R2 (+6 points overall) is larger than the improvement from o1-preview to o3 in the same period.

The architecture improvements

From the technical report:

| Feature | R1 | R2 | |---------|----|----| | Base model | DeepSeek V3 (671B, 37B active) | DeepSeek V4 (800B, 45B active) | | RL training | GRPO | GRPO v2 (improved reward signal) | | Thinking efficiency | Average | 20% fewer tokens per problem | | Multi-step planning | Implicit | Explicit (structured thinking) | | Tool use | Basic | Improved (code execution during thinking) |

Sources: DeepSeek R2 technical report.

The base model upgrade (V3 to V4) gives R2 more knowledge to reason over. The GRPO v2 training produces more efficient thinking chains. And the explicit multi-step planning reduces circular reasoning.

Open source distilled models

| Model | Parameters | MATH score | Cost per hard problem | |-------|-----------|-----------|----------------------| | R2 (full) | 800B (45B active) | 98.1% | $0.006 | | R2-Distill-Qwen-32B | 32B | 95.8% | ~$0.002 | | R2-Distill-Llama-14B | 14B | 93.2% | ~$0.001 | | R2-Distill-Qwen-7B | 7B | 88.6% | ~$0.0003 |

Sources: DeepSeek R2 technical report, Hugging Face.

A 7B distilled model scoring 88.6% on MATH. Running locally on a MacBook. For free.

A year ago, o1-preview scored 96.4% on MATH and cost $0.87 per problem. Now a model you can run on your laptop gets within 8 points of that score at essentially zero cost.

The democratization of reasoning capability happened faster than anyone predicted. Including me.


If you found this interesting, you might also like:

-- dataku

More from dataku