Data StoriesJune 2, 20254 min read

AI model sizes are SHRINKING. Here's the data.

The biggest model released in 2025 so far has fewer parameters than GPT-4. Efficiency gains from MoE, distillation, and better training data mean the era of "bigger is better" is fading. I charted the trend.

I noticed something strange while updating my model tracking spreadsheet.

The biggest model released in 2025 (so far) has fewer total parameters than GPT-4 had in 2023.

That's not an accident. It's a trend.

Peak parameter count by year

| Year | Largest model released | Total params | Active params | |------|----------------------|-------------|--------------| | 2020 | GPT-3 | 175B | 175B | | 2022 | PaLM | 540B | 540B | | 2023 | GPT-4 (estimated) | 1.8T (rumored MoE) | ~280B (estimated) | | 2024 | Llama 3.1 405B | 405B | 405B | | 2024 | DeepSeek V3 | 671B | 37B | | 2025 | Llama 4 Maverick | 400B | 17B | | 2025 | Qwen3 235B | 235B | 22B |

Sources: Model papers, Epoch AI training compute database, Meta AI, DeepSeek, Alibaba Qwen, industry estimates for GPT-4.

Look at the "active params" column. It peaked somewhere around GPT-4 (estimated ~280B active) and has been falling since. The best 2025 models use 17-37B active parameters per token.

Llama 4 Maverick has 400B total parameters but only activates 17B per token. That's a 96% "idle rate." Most of the model sits dormant for any given input, waiting to be called up by the routing layer.

Why models are shrinking

Three forces are pushing model sizes down:

| Factor | How it reduces size | Impact | |--------|-------------------|--------| | Mixture of Experts (MoE) | Only activate a subset of parameters per token | 5-20x fewer active params | | Better training data | Higher quality data = less parameters needed to learn | 2-3x efficiency gain | | Distillation | Compress large model knowledge into smaller models | 5-10x size reduction |

Sources: DeepSeek V3/R1 papers, Google DeepMind Gemini reports, Chinchilla scaling analysis.

The Chinchilla insight from 2022 was the first warning sign: most models were undertrained on too little data. Since then, the industry has shifted toward more data, better data, and smarter architectures rather than just adding parameters.

Active parameters vs benchmark quality

| Model | Active params | MMLU | Year | |-------|-------------|------|------| | GPT-3 | 175B | 43.9% | 2020 | | PaLM | 540B | 69.3% | 2022 | | GPT-4 (est.) | ~280B | 86.4% | 2023 | | DeepSeek V3 | 37B | 87.1% | 2024 | | Llama 4 Maverick | 17B | 85.5% | 2025 | | Qwen3 235B | 22B | 88.4% | 2025 |

Qwen3 with 22B active parameters scores 88.4% on MMLU. GPT-4 with an estimated 280B active scores 86.4%. Fewer active parameters, higher benchmark score.

The "bigger = better" era is over. The "smarter = better" era has replaced it.

What this means for inference costs

Smaller active parameter counts directly translate to cheaper, faster inference:

| Active params | Approximate tokens/sec (A100) | Relative cost | |--------------|------------------------------|--------------| | 175B (dense) | ~15 | 12x | | 70B (dense) | ~35 | 5x | | 37B (MoE from 671B) | ~55 | 3x | | 17B (MoE from 400B) | ~110 | 1.5x | | 7B (dense, small) | ~180 | 1x |

The shift to MoE with small active parameter counts means frontier-quality inference at mid-tier prices. Llama 4 Maverick gives you frontier-adjacent quality at the inference cost of a 17B model.

My prediction

| Prediction | Timeline | |-----------|----------| | No model with >500B active params will be SOTA | Already true | | Default active params for frontier models: 15-40B | 2025-2026 | | 1B active parameter models matching GPT-3.5 quality | Late 2025 | | "Parameter count" stops being a useful comparison metric | Happening now |

The number everyone should care about isn't "how big is the model?" It's "how many active parameters does it use per token?" That determines speed, cost, and hardware requirements.

My model tracking spreadsheet has a new default sort: active parameters, ascending. The most interesting models are the smallest ones that punch above their weight.


If you found this interesting, you might also like:

-- dataku

More from dataku