Benchmark AnalysisSeptember 26, 20227 min read

The Chinchilla scaling laws changed everything. Let me show you why.

DeepMind's Chinchilla paper says most large models are undertrained. I ran the numbers: if Chinchilla's scaling laws are right, GPT-3 should have used 4.6x more training data. The implications are huge.

I need to talk about a paper that I think will be remembered as one of the most important AI publications of the decade.

DeepMind's Chinchilla paper ("Training Compute-Optimal Large Language Models") dropped in March 2022, and it basically said: everyone is building models wrong.

Not wrong in a small way. Wrong in a fundamental, "you're wasting billions of dollars" way. Let me walk through the math.

The old assumption

Before Chinchilla, the prevailing wisdom came from Kaplan et al.'s scaling laws paper (2020, OpenAI). The key finding was that model performance improves predictably as you increase model size (parameters). The recommendation: if you have more compute, make the model bigger.

This led to the parameter arms race. GPT-3 (175B), Gopher (280B), Megatron-Turing NLG (530B), PaLM (540B). Bigger, bigger, bigger.

The Kaplan paper did mention training data, but the emphasis was on parameters. And that emphasis shaped how every major lab allocated their compute budgets for two years.

The Chinchilla finding

DeepMind trained over 400 models ranging from 70M to 16B parameters, on datasets from 5B to 500B tokens. They measured the loss for every combination and found the optimal relationship between model size and training data.

Their conclusion: for a given compute budget, you should increase both model size AND training data equally. Specifically, the optimal ratio is approximately 20 tokens of training data per parameter.

That ratio changes everything. Look at how the major models stack up:

| Model | Parameters | Training tokens | Tokens/param ratio | Chinchilla-optimal tokens | Under/overtrained? | |-------|-----------|----------------|--------------------|--------------------------|--------------------| | GPT-3 | 175B | 300B | 1.7 | 3,500B | 11.7x undertrained | | Gopher | 280B | 300B | 1.1 | 5,600B | 18.7x undertrained | | PaLM | 540B | 780B | 1.4 | 10,800B | 13.8x undertrained | | Megatron-Turing NLG | 530B | 339B | 0.6 | 10,600B | 31.3x undertrained | | Chinchilla | 70B | 1.4T | 20.0 | 1,400B | Optimal | | BLOOM | 176B | 366B | 2.1 | 3,520B | 9.6x undertrained |

Read the "under/overtrained" column. Gopher, DeepMind's own 280B model released just months before Chinchilla, was trained on 18.7x less data than it should have been. Megatron-Turing NLG is 31.3x undertrained.

And GPT-3, the model that started the scaling era, used 300 billion training tokens when the optimal amount was 3.5 trillion. That's 11.7x less data than it needed.

Wait. I need to double-check my math.

Hold on. Let me recompute the GPT-3 number because it's so striking.

GPT-3: 175 billion parameters x 20 tokens per parameter = 3,500 billion optimal tokens. It was trained on 300 billion tokens. 3,500 / 300 = 11.67.

Yeah. 11.7x undertrained. The math is right and it's wild.

The proof: Chinchilla vs. Gopher

The strongest evidence is the head-to-head comparison. Chinchilla (70B parameters, 1.4T tokens) vs. Gopher (280B parameters, 300B tokens). Chinchilla is 4x smaller. It should lose, right?

| Benchmark | Chinchilla (70B) | Gopher (280B) | Winner | |-----------|-----------------|---------------|--------| | MMLU (5-shot) | 67.6% | 60.0% | Chinchilla (+7.6) | | HellaSwag | 80.8% | 79.2% | Chinchilla (+1.6) | | LAMBADA | 77.4% | 74.5% | Chinchilla (+2.9) | | WinoGrande | 74.9% | 70.1% | Chinchilla (+4.8) | | BoolQ | 83.7% | 79.3% | Chinchilla (+4.4) | | TriviaQA | 72.3% | 65.0% | Chinchilla (+7.3) | | ARC-Easy | 80.0% | 76.0% | Chinchilla (+4.0) |

Chinchilla wins every single benchmark. A model with 4x fewer parameters, trained on 4.6x more data, beats the bigger model across the board.

The MMLU gap is the most striking: 67.6% vs 60.0%. A 7.6 percentage point improvement from a model one-quarter the size. That's not a marginal gain. That's a different class of performance.

The cost implications

This is where it gets really interesting for anyone paying GPU bills.

Under the old scaling laws, if you wanted a better model, you needed more parameters, which meant more GPUs for both training and inference. A 540B model like PaLM requires hundreds of GPUs just to serve.

Under Chinchilla's scaling laws, a compute-optimal model for the same budget would be smaller (fewer parameters) but trained on more data (longer training time, not more hardware). And smaller models are cheaper to serve.

Let me illustrate:

| Approach | Model size | Training data | Training cost (est.) | Inference cost per query | |----------|-----------|--------------|---------------------|------------------------| | Old scaling (Gopher-style) | 280B | 300B tokens | ~$6M | High (multi-GPU) | | Chinchilla-optimal (same compute) | 70B | 1.4T tokens | ~$6M | 4x lower (fewer GPUs) | | Performance | | | | Chinchilla wins |

Same training budget. Better performance. Lower inference cost. It's a strictly better allocation of resources.

For startups and smaller research labs, this is especially significant. You don't need to build a 175B model to compete. You need to build a 40B model and train it on a LOT of data. The hardware requirements for serving a 40B model are within reach of a single high-end server. Serving 175B requires a cluster.

What this means for the parameter race

I think the parameter race is over. Or at least, it should be.

The labs will keep releasing large models (PaLM is 540B), but the competitive advantage is shifting from "who can build the biggest model" to "who can assemble the best training data." Data quality, data diversity, and data volume become the differentiators.

Epoch AI has been tracking compute trends and they're seeing the same shift. The latest models from DeepMind and Google are investing more in data curation and less in raw parameter count.

The implications for open source are also interesting. LAION-5B gives the open source community access to massive training datasets. If the Chinchilla scaling laws hold, a well-funded open source project training a 30-40B model on high-quality curated data could produce results competitive with much larger closed models.

Where I think Chinchilla might be wrong

I have one disagreement with the paper, and I want to be upfront about it.

Chinchilla's analysis assumes you care equally about training cost and inference cost. But in production, you train once and serve millions of times. A slightly larger model that's cheaper to train (because you used less data) but performs the same might actually be more economical if your inference volume is low.

The 20:1 token-to-parameter ratio is optimal for a single training run. It's not necessarily optimal for total cost of ownership including deployment. The paper acknowledges this but doesn't deeply explore it.

For most organizations, though, the insight holds. Train smaller models on more data. The benchmarks prove it works.

The bottom line

DeepMind published a paper that says its own recently-released model (Gopher) was built wrong. That takes scientific honesty. It also says GPT-3, PaLM, and basically every other large language model was built wrong.

The fix is counterintuitive: use fewer parameters and more data. But the data in the paper is about as clear as empirical evidence gets.

I'm updating all my model tracking spreadsheets to include the tokens-per-parameter ratio as a standard column. It might be the single most important number for predicting model quality going forward.


If you found this interesting, you might also like:

-- dataku

More from dataku