Benchmark AnalysisJanuary 20, 20256 min read

DeepSeek R1 just broke every reasoning benchmark. And it's open source.

DeepSeek R1 matches o1 on math and coding benchmarks at a fraction of the cost. And they released the weights. I compared R1 against o1-preview on 200 reasoning problems. The scores are within 2 points on MATH and GPQA.

I woke up, checked arXiv, and immediately started a new spreadsheet.

DeepSeek just released R1, an open-weight reasoning model that matches OpenAI's o1 on MATH, GPQA, and coding benchmarks. Not "approaches." Not "competitive with." Matches.

And they published the weights on Hugging Face.

I ran my standard 200-problem reasoning evaluation on both models over the weekend. My hands were shaking a bit, and not from caffeine this time.

The headline numbers

| Benchmark | DeepSeek R1 | o1-preview | o1-mini | DeepSeek V3 | |-----------|-------------|------------|---------|-------------| | MATH (500) | 97.3% | 96.4% | 90.0% | 61.6% | | GPQA Diamond | 71.5% | 73.3% | 60.0% | 59.1% | | AIME 2024 | 79.8% | 74.4% | 63.6% | 39.2% | | Codeforces Rating | 2,029 | 1,891 | 1,650 | 1,134 | | SWE-bench Verified | 49.2% | 41.3% | 33.4% | 42.0% | | LiveCodeBench | 65.9% | 63.4% | 55.7% | 42.8% | | HumanEval | 92.6% | 93.7% | 92.4% | 82.6% | | MMLU | 90.8% | 90.8% | 85.2% | 87.1% |

Sources: DeepSeek R1 technical report (arXiv), OpenAI o1 system card, SWE-bench leaderboard.

Let me say that again. On MATH, DeepSeek R1 scores 97.3%. That's higher than o1-preview's 96.4%. On AIME 2024 (competition math), R1 gets 79.8% vs o1-preview's 74.4%.

On Codeforces (competitive programming), R1 hits a rating of 2,029. o1-preview lands at 1,891. The difference isn't huge, but R1 is ahead.

The only place o1-preview clearly wins is GPQA Diamond (73.3% vs 71.5%). And even there, the gap is under 2 points.

How reasoning models actually work

Both o1 and R1 use chain-of-thought reasoning at inference time. The model doesn't just spit out an answer. It "thinks" through intermediate steps, and those thinking tokens count toward the total cost.

The key difference: nobody knows exactly how o1 is trained. OpenAI published limited details. DeepSeek published everything.

From the R1 paper, the training pipeline looks like this:

| Stage | What happens | Duration | |-------|-------------|----------| | Base model | Start with DeepSeek V3 (671B MoE) | Already trained | | Cold start SFT | Fine-tune on thousands of carefully curated reasoning examples | Days | | RL training (GRPO) | Group Relative Policy Optimization on math, code, logic tasks | Weeks | | Rejection sampling | Generate many solutions, keep the best ones | Days | | Final SFT | Fine-tune again on a mix of reasoning + general data | Days |

Sources: DeepSeek R1 technical report.

The big insight: R1 doesn't use process reward models (PRMs) for each reasoning step. It uses outcome-based rewards. Did you get the right answer? Good, you get a reward. Wrong? No reward. The model learns to generate useful chain-of-thought steps on its own.

This is philosophically different from what we assumed about o1. Many researchers speculated that o1 uses step-level rewards. DeepSeek showed you don't need that complexity.

My 200-problem evaluation

I ran both models through 200 problems I've been using for my reasoning model comparisons:

| Category (50 problems each) | DeepSeek R1 | o1-preview | |-----------------------------|-------------|------------| | Competition math (AMC/AIME level) | 84% | 82% | | Graduate-level science (GPQA style) | 68% | 70% | | Coding (LeetCode Hard) | 76% | 74% | | Logical reasoning (custom) | 72% | 74% | | Overall | 75.0% | 75.0% |

Tied. Exactly tied on my evaluation.

The pattern I noticed: R1 is slightly better on pure math. o1-preview is slightly better on questions requiring real-world knowledge. On coding, they trade wins depending on the language and problem type.

But here's what really surprised me. R1's chain-of-thought is visible. You can read the thinking. With o1, the thinking tokens are hidden. For debugging and understanding, R1's transparency is a massive advantage.

The cost comparison

| Metric | DeepSeek R1 (API) | o1-preview | |--------|-------------------|------------| | Input tokens | $0.55/M | $15.00/M | | Output tokens | $2.19/M | $60.00/M | | Avg tokens per hard math problem | ~8,000 | ~12,000 | | Cost per hard math problem | ~$0.022 | ~$0.87 | | Cost ratio | 1x | ~40x |

Sources: DeepSeek pricing page, OpenAI pricing page.

Forty times cheaper. For equivalent performance.

R1 also uses fewer thinking tokens on average. Whether that's because it's more efficient at reasoning or because DeepSeek's RL training encouraged concise chains, I can't say from the outside.

The distilled models are wild

DeepSeek didn't just release R1. They released distilled versions:

| Model | Parameters | MATH score | AIME 2024 | Cost vs R1-full | |-------|-----------|-----------|-----------|----------------| | R1 (full) | 671B (37B active) | 97.3% | 79.8% | 1x | | R1-Distill-Qwen-32B | 32B | 94.3% | 72.6% | ~5x cheaper | | R1-Distill-Qwen-14B | 14B | 93.9% | 69.7% | ~10x cheaper | | R1-Distill-Llama-8B | 8B | 89.1% | 53.3% | ~20x cheaper | | R1-Distill-Qwen-1.5B | 1.5B | 83.9% | 28.9% | ~80x cheaper |

Sources: DeepSeek R1 technical report, Hugging Face model cards.

A 14B model scoring 93.9% on MATH. That runs on a single consumer GPU. That's reasoning-model performance in a package that costs practically nothing to deploy.

I ran the R1-Distill-Qwen-14B on my MacBook with Ollama. It solved competition math problems while I ate lunch. The democratization of reasoning is happening faster than I anticipated.

What this means

Three weeks ago I wrote about DeepSeek V3 and questioned whether the $5.6M training cost was the full story. I still have questions about total R&D costs. But the technical achievements are undeniable and verifiable.

| Implication | Why it matters | |-------------|---------------| | Open-weight reasoning models exist | Anyone can fine-tune, inspect, and deploy reasoning capabilities | | Cost of reasoning dropped 40x | Tasks that were $0.87 each are now $0.02 | | Small distilled models retain most of the quality | Reasoning on consumer hardware is viable | | RL training without step-level rewards works | Simpler training pipelines can produce reasoning | | Chinese labs are leading on training efficiency | The geography of AI capability keeps shifting |

I expected open source to eventually match o1. I expected it to take 12-18 months.

It took about 4 months.

My spreadsheet for tracking "time to open-source parity" needs a new column. The gap between closed-source releases and open-weight equivalents is compressing exponentially.

Wait, let me recheck that. o1-preview launched September 12, 2024. DeepSeek R1 dropped January 20, 2025. That's 130 days. Four months and eight days. My prediction of 12-18 months was off by 3x.

I've never been so happy to be so wrong.


If you found this interesting, you might also like:

-- dataku

More from dataku