Google's PaLM has 540 billion parameters. Let me put that number in context.
Every time a new model drops, the parameter count gets bigger and the context gets lost. I made a chart showing every major model's parameter count since 2018. PaLM is... a lot.
Google just dropped PaLM. 540 billion parameters.
Every time one of these announcements hits, my timeline fills up with people saying "wow, big number." And then the next big number comes along and the previous one is forgotten. The scale gets lost.
So I made a table. Every major language model since 2018, with parameter counts, so we can actually see what 540 billion means in context.
The parameter timeline
| Model | Organization | Date | Parameters | Relative to BERT | |-------|-------------|------|-----------|-----------------| | BERT-Large | Google | Oct 2018 | 340M | 1x | | GPT-2 | OpenAI | Feb 2019 | 1.5B | 4.4x | | Megatron-LM | NVIDIA | Sep 2019 | 8.3B | 24x | | T5-11B | Google | Oct 2019 | 11B | 32x | | GPT-3 | OpenAI | Jun 2020 | 175B | 515x | | GShard | Google | Jan 2021 | 600B* | 1,765x | | Switch Transformer | Google | Jan 2021 | 1.6T* | 4,706x | | Megatron-Turing NLG | NVIDIA/Microsoft | Oct 2021 | 530B | 1,559x | | Gopher | DeepMind | Dec 2021 | 280B | 824x | | Chinchilla | DeepMind | Mar 2022 | 70B | 206x | | PaLM | Google | Apr 2022 | 540B | 1,588x |
*GShard and Switch Transformer are mixture-of-experts models. Their active parameter count per input is much smaller.
BERT to PaLM in three and a half years. A 1,588x increase in parameter count. That growth rate is not normal by any standard in computing history.
But wait. Chinchilla.
Here's where the plot gets interesting. Look at Chinchilla sitting there at 70 billion parameters, released the same month as PaLM.
DeepMind's Chinchilla paper argues that most large models are undertrained. The optimal ratio, according to their research, is roughly 20 tokens of training data per parameter. By that math, a 70B model trained on 1.4 trillion tokens should outperform a 280B model trained on 300 billion tokens.
And it does. Chinchilla (70B) beats Gopher (280B) on most benchmarks despite being 4x smaller.
This makes the PaLM number complicated. Is 540B the right way to build a frontier model? Or would a smaller model trained on more data perform just as well?
| Model | Parameters | Training tokens | Tokens per parameter | |-------|-----------|----------------|---------------------| | GPT-3 | 175B | 300B | 1.7 | | Gopher | 280B | 300B | 1.1 | | Chinchilla | 70B | 1.4T | 20.0 | | PaLM | 540B | 780B | 1.4 |
PaLM's tokens-per-parameter ratio is 1.4. Chinchilla's is 20.0. If DeepMind is right about the scaling laws, PaLM is significantly undertrained for its size. Google Research hasn't directly addressed this comparison, but the numbers speak for themselves.
What PaLM does well
I don't want to be unfair here. PaLM's technical report shows strong results across multiple benchmarks, and it introduced some genuinely interesting capabilities:
| Benchmark | PaLM 540B | Chinchilla 70B | GPT-3 175B | |-----------|----------|----------------|------------| | MMLU (5-shot) | 69.3% | 67.6% | 43.9% | | BIG-Bench (avg) | 65.8% | ~58% (est.) | 48.2% | | TriviaQA | 81.4% | 72.3% | 64.3% | | Code generation (HumanEval) | 26.2% | N/A | 0% (not designed for code) |
PaLM is genuinely better than Chinchilla on these benchmarks. So raw parameter count still matters even if the training efficiency could be better.
The chain-of-thought reasoning results are what caught my attention most. PaLM can solve multi-step math problems that smaller models consistently fail at. Google showed it solving problems from the GSM8K math benchmark at 58.1% accuracy, up from Chinchilla's ~43%.
The training cost question
Google trained PaLM on 6,144 TPU v4 chips. That's three full TPU v4 pods. Based on public TPU v4 pricing and the disclosed training time, here's my rough estimate:
| Component | Estimate | |-----------|---------| | Hardware | 6,144 TPU v4 chips | | Training duration | ~2 months | | Estimated compute cost | $8-12M | | Total with infrastructure | $10-15M (est.) |
Compare that to earlier models: GPT-3 was estimated at $4.6M, Gopher at $6-8M. PaLM is pushing into the $10M+ territory, and that's using Google's own hardware at internal rates. An external organization renting equivalent compute would pay significantly more.
The number of organizations that can afford a $10M+ training run is very small. This isn't a criticism. It's just data. As per Epoch AI's compute tracking, the cost of training frontier models has been increasing roughly 4-5x per year since 2018.
What the parameter race actually tells us
I've been tracking these numbers for over a year now, and here's my honest read: the raw parameter count is becoming a less useful metric with every new paper.
Chinchilla at 70B outperforms Gopher at 280B. A well-trained smaller model beats a poorly-trained larger one. The actual question isn't "how many parameters" but "how many parameters times how much training data, with what architecture."
But nobody writes headlines about tokens-per-parameter ratios. "540 BILLION PARAMETERS" gets clicks. So the parameter count will keep being the number people talk about.
I'll keep putting it in context so the number actually means something.
If you found this interesting, you might also like:
- DALL-E's first images vs what people expected: a data comparison
- GPT-3 vs GPT-J: the first real open source challenger, in data
- I counted every AI startup that raised money in Q1 2021. The numbers are strange.
- My 2021 AI data roundup: the 10 numbers that mattered most
- The cost of running an AI startup in 2022: a data breakdown
-- dataku