Data StoriesSeptember 20, 20216 min read

The training cost curve is doing something weird

I plotted the estimated training costs of every major model from 2018 to 2021. The curve isn't going up linearly. It's doing something much weirder, and the inflection point was GPT-3.

I've been obsessed with a chart for the past month.

It started as a simple exercise: plot the estimated training costs of every major language model from 2018 to 2021. I expected a straight line going up. Bigger models cost more to train. Simple story, right?

The line is not straight. And what it's actually doing has some implications I can't stop thinking about.

The raw data

Estimating training costs is tricky because most companies don't publish exact numbers. I used a combination of methods: official disclosures (rare), compute estimates from the papers themselves (common), and independent analyses from Epoch AI and the Stanford AI Index Report. Where multiple estimates existed, I averaged them.

Here's the table:

| Model | Year | Parameters | Est. training cost | Source | |-------|------|-----------|-------------------|--------| | BERT-Large | 2018 | 340M | ~$7,000 | Google paper, compute analysis | | GPT-2 | 2019 | 1.5B | ~$40,000 | OpenAI, independent estimates | | T5-11B | 2019 | 11B | ~$1.3M | Google paper, TPU hours | | Megatron-LM | 2019 | 8.3B | ~$400,000 | NVIDIA, GPU hours disclosed | | GPT-3 | 2020 | 175B | ~$4.6M | OpenAI, Epoch AI analysis | | GShard | 2020 | 600B | ~$5M (est.) | Google AI Blog, TPU estimates | | Switch Transformer | 2021 | 1.6T | ~$3.5M (est.) | Google, sparse model (lower effective cost) | | GPT-J-6B | 2021 | 6B | ~$50,000 | EleutherAI, TPU pod grant | | Megatron-Turing NLG | 2021 | 530B | ~$12M (est.) | NVIDIA + Microsoft, A100 clusters | | Ernie 3.0 | 2021 | 10B | ~$1.5M (est.) | Baidu, limited disclosure |

Quick note: GShard and Switch Transformer are mixture-of-experts models, so their "parameter count" is misleading. Not all parameters are active during inference. Their effective compute cost is lower than a dense model of the same size would be.

The weird curve

When you plot these on a timeline, two things jump out.

First, the cost increase from BERT to GPT-2 to T5 was gradual. BERT cost about $7K. GPT-2 cost about $40K. T5-11B cost about $1.3M. That's steep, but it took over a year and a 30x increase in parameters.

Then GPT-3 hit.

$4.6 million. For a single training run. And that opened the floodgates. Megatron-Turing NLG is estimated at $12 million. These numbers are in a completely different category.

| Year | Max training cost | Year-over-year increase | |------|------------------|------------------------| | 2018 | ~$7K (BERT) | Baseline | | 2019 | ~$1.3M (T5) | 186x | | 2020 | ~$5M (GShard) | 3.8x | | 2021 | ~$12M (Megatron-Turing) | 2.4x |

The year-over-year multiplier is actually decreasing. 186x from 2018 to 2019, then 3.8x, then 2.4x. The raw dollars keep going up, but the rate of increase is slowing down.

What's causing the slowdown

Three factors are bending the curve:

1. Hardware efficiency gains

The A100 GPU (released 2020) provides roughly 2x the training throughput of the V100 (2017) for large language models. TPU v4 (2021) provides similar generational gains. So a model that would have cost $10M to train on V100s in 2019 might cost $5M on A100s in 2021. NVIDIA's published MLPerf results confirm these gains.

| Hardware | Year | Relative training speed (normalized) | |----------|------|-------------------------------------| | V100 | 2017 | 1.0x | | A100 | 2020 | 2.0-2.5x | | TPU v3 | 2018 | 1.3x (roughly, task-dependent) | | TPU v4 | 2021 | 2.7x (est.) |

2. Training technique improvements

Mixed-precision training (using FP16 instead of FP32), gradient checkpointing, and better data parallelism strategies have reduced the compute needed per parameter. The GPT-3 paper itself notes using mixed-precision training. DeepSpeed from Microsoft has pushed this even further.

An analysis by Epoch AI estimates that algorithmic improvements have reduced the compute required to reach a given performance level by roughly 2x every 16 months. Not as fast as Moore's Law, but meaningful.

3. Mixture-of-experts models

Switch Transformer has 1.6 trillion parameters but didn't cost 10x more than GPT-3 to train. That's because it's a sparse model: only a fraction of parameters are active for each input. This architectural trick breaks the assumption that "more parameters = proportionally more compute."

If MoE architectures become standard (and the trend suggests they will), the cost curve flattens significantly. You can have a model with a trillion parameters that costs roughly the same to train as a 100B dense model.

The inflection point story

Here's what I think the data is telling us. Before GPT-3, the cost of training the biggest model roughly tracked a power law: each order-of-magnitude increase in parameters cost roughly 10-50x more. After GPT-3, three forces started pushing back: better hardware, better algorithms, and architectural tricks like MoE.

The result is that the cost curve is bending. Not flattening, not going down. Just bending from exponential toward something slower.

If you extrapolate the pre-GPT-3 trend, training a 1T-parameter dense model should cost around $50-100M. If you extrapolate the post-GPT-3 trend (with MoE and hardware gains), it's more like $15-25M.

That difference matters enormously for who can participate in frontier AI research. At $100M, only Google, Microsoft, and maybe five other companies can play. At $15M, the circle of possible participants is much wider.

What I got wrong initially

Wait, I should be honest about this. My first version of this chart had the Switch Transformer at $10M, which would have shown the curve still accelerating. I'd used the wrong compute estimate (treating it as a dense model). A reader on Twitter pointed out the MoE correction. After fixing it, the curve tells a different story.

This is why I track corrections in my data. Getting it wrong and fixing it is part of the process.

Looking forward

Two things I'm watching:

arXiv papers on efficient training methods are accelerating. In 2020, I counted 23 papers with "efficient training" in the title or abstract. In the first 8 months of 2021, I've already counted 31. The research community is actively working on bending this curve further.

Second, the big labs are getting secretive. OpenAI hasn't disclosed GPT-3's training cost officially. The $4.6M figure is an independent estimate. Google rarely discloses TPU hours in enough detail to calculate costs. As training costs become competitive intelligence, the data will get harder to find.

I'll keep estimating. The curve tells a story, even if the exact numbers are approximate. And right now, the story it's telling is more interesting than "line goes up."


If you found this interesting, you might also like:

-- dataku

More from dataku