DeepSeek V3: a Chinese model that costs almost nothing to train

$5.6 million.

That's what DeepSeek claims it cost to train their new V3 model. A mixture-of-experts model with 671 billion total parameters that matches or beats Llama 3.1 405B and GPT-4o on multiple benchmarks.

For context, industry estimates put GPT-4's training cost at $100M+. Llama 3.1 405B probably cost Meta $30-50M in compute alone.

If DeepSeek's number is real, everything we thought about the cost barrier to frontier AI just changed.

The model architecture

| Spec | DeepSeek V3 | Llama 3.1 405B | GPT-4o (estimated) | |------|-------------|----------------|-------------------| | Total parameters | 671B | 405B | Unknown (est. 200-500B) | | Active parameters | 37B | 405B | Unknown | | Architecture | MoE (mixture of experts) | Dense transformer | Unknown (likely MoE) | | Training tokens | 14.8T | 15T | Unknown | | Training hardware | 2,048 H800 GPUs | 16,384 H100 GPUs | Unknown | | Training time | ~2 months | ~3 months | Unknown | | Estimated training cost | $5.6M | $30-50M est. | $100M+ est. | | Context window | 128K | 128K | 128K |

Sources: DeepSeek V3 technical report (arXiv), Meta AI Llama 3.1 paper, industry estimates for GPT-4.

The key number: 37 billion active parameters. DeepSeek V3 is a MoE model that activates only 37B of its 671B parameters per token. This means it has the knowledge capacity of a 671B model but the inference cost of a ~37B model.

The benchmark data

| Benchmark | DeepSeek V3 | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet | |-----------|-------------|----------------|--------|-------------------| | MMLU | 87.1% | 87.3% | 88.7% | 88.7% | | HumanEval | 82.6% | 89.0% | 90.2% | 93.7% | | MATH | 61.6% | 73.8% | 76.6% | 78.3% | | GPQA | 59.1% | 51.1% | 53.6% | 59.4% | | GSM8K | 89.3% | 96.8% | 95.8% | 96.4% | | ARC-Challenge | 91.4% | 96.9% | 96.7% | 96.7% | | MGSM | 79.8% | 91.6% | 90.5% | 91.6% | | IFEval | 86.2% | 86.0% | 85.4% | 86.9% | | SWE-bench Verified | 42.0% | N/A | 33.2% | 49.0% | | Codeforces Rating | 51st pct | N/A | 11th pct | N/A |

Sources: DeepSeek V3 technical report, prior model papers, SWE-bench leaderboard.

DeepSeek V3 is competitive with the frontier on most benchmarks. On MMLU (87.1% vs 87.3% for Llama 3.1 405B), it's a virtual tie. On GPQA (59.1%), it actually beats Llama 3.1 405B (51.1%) and GPT-4o (53.6%).

On SWE-bench Verified (42.0%), it beats GPT-4o (33.2%) but trails Claude 3.5 Sonnet (49.0%).

The weaknesses: GSM8K (89.3% vs 96.8% for Llama 3.1 405B) and ARC-Challenge (91.4% vs 96.9%) show gaps on some benchmarks. It's not uniformly frontier. But on the aggregate, it's in the conversation.

The training cost breakdown

The DeepSeek technical report provides an unusually detailed cost breakdown:

| Component | Details | Estimated cost | |-----------|---------|---------------| | Pre-training compute | 2,048 H800 GPUs for 2,788K GPU-hours | $5.576M | | Hardware | Rented NVIDIA H800 80GB | Included in compute | | Data curation | Not specified separately | Minimal (existing pipeline) | | RLHF/alignment | Not specified separately | Minimal | | Total reported | | $5.576M |

Source: DeepSeek V3 technical report, Section 2.

2,788,000 GPU-hours on H800 chips. At a rental price of roughly $2.00/GPU-hour (estimated for Chinese cloud providers), that's $5.576M.

A few important caveats:

1. H800 vs H100. DeepSeek used NVIDIA H800 GPUs, the export-restricted version sold to China. H800s have reduced interconnect bandwidth (400 GB/s vs 900 GB/s for H100). They cost less to rent in China but have lower multi-GPU training efficiency. DeepSeek's training innovations specifically address this bandwidth limitation.

2. The $5.6M excludes R&D costs. The team's salaries, the cost of developing the training framework, the earlier DeepSeek V2 experiments that informed V3. The $5.6M is pure compute cost.

3. China's compute costs are lower. GPU rental in China is cheaper than in the US, partly due to subsidies and partly due to lower labor costs for data center operations.

That said, even if you double or triple the cost to account for US rental prices and hidden costs, $11-17M for a frontier model is dramatically cheaper than the $100M+ estimates for GPT-4.

How did they do it so cheaply?

Three technical innovations, all documented in the paper:

| Innovation | Impact | How it works | |-----------|--------|-------------| | FP8 mixed-precision training | 2x compute reduction | Uses 8-bit floating point for most operations, 16-bit for critical ones | | Multi-token prediction | 1.5x training efficiency | Model predicts 2 tokens ahead simultaneously, more learning per forward pass | | Efficient MoE routing | 1.3x parameter efficiency | Better expert selection reduces wasted computation |

The FP8 training is the biggest deal. Most models train in FP16 or BF16 (16-bit precision). DeepSeek trained in FP8 (8-bit) with careful management of numerical precision in critical layers. This roughly halves the compute needed per training step.

Combined with multi-token prediction (the model learns from predicting 2 tokens per step instead of 1), the total training efficiency is roughly 3x better than a standard training pipeline.

What this means

| If DeepSeek's costs are real... | Implication | |--------------------------------|------------| | Frontier training costs ~$5-15M, not $100M+ | Many more organizations can afford frontier training | | H800 (export-restricted chip) is sufficient | US chip export controls are less effective than assumed | | MoE + FP8 + multi-token prediction works | Training efficiency innovations matter more than raw compute | | Chinese AI is cost-competitive | The "compute moat" argument for US AI dominance weakens |

The industry narrative in 2024 was: "only organizations with $100M+ budgets can train frontier models." DeepSeek V3 suggests the real number might be 10-20x lower if you're smart about training efficiency.

This has implications for everyone. For AI startups: the barrier to entry for frontier-class models is much lower than assumed. For NVIDIA: selling fewer, expensive chips may not be the bottleneck people think (efficiency improvements can compensate for chip count). For geopolitics: export controls on H100s pushed Chinese labs toward H800s, which forced them to innovate on training efficiency, which made them more cost-effective.

My honest assessment

I'm not 100% convinced the $5.6M number tells the full story. It excludes R&D, failed experiments, and the cost of the data pipeline. The "true" cost including everything is probably $15-25M.

But even at $25M, that's 4-6x cheaper than Llama 3.1 405B and 4-10x cheaper than GPT-4. The efficiency innovations are real and verifiable (FP8 training, multi-token prediction). The benchmarks are real and reproducible (the model is open weight).

This is the most important training cost data point since GPT-3's estimated $4.6M in 2020. Except this model is frontier-competitive, not just interesting.

The spreadsheet has a new column: training cost efficiency (benchmark score per dollar of training compute). DeepSeek V3 just set the record.

If you found this interesting, you might also like:

-- dataku

DeepSeek V3: a Chinese model that costs almost nothing to train

The model architecture

The benchmark data

The training cost breakdown

How did they do it so cheaply?

What this means

My honest assessment

More from dataku

Claude Opus 4.6 review: the 1M context model

o4-mini vs Claude 4 Sonnet vs Gemini 2.5 Flash: the speed tier showdown

Gemini 2.5 Ultra: Google's best model vs the field