2025 in AI data: the year quality beat scale

2025 was the year the AI industry's fundamental assumption changed.

For five years, the story was "bigger is better." More parameters. More data. More compute. Scaling laws predicted that pouring more resources in would always produce better models out.

2025 proved that thesis partially wrong. Quality of training, not quantity of compute, became the differentiator.

Let me walk you through the 30 data points that tell this year's story.

The big themes

Theme 1: Model sizes stopped growing

| Year | Largest active params (frontier model) | |------|--------------------------------------| | 2022 | 540B (PaLM) | | 2023 | ~280B (GPT-4, estimated) | | 2024 | 405B (Llama 3.1 405B) | | 2025 | ~100B active (Llama 4 405B MoE) |

Sources: Model papers, Epoch AI.

The largest number of active parameters at the frontier actually decreased in 2025. MoE architectures let models have broad knowledge (400B total) while only using a fraction (17-100B active) per query.

Theme 2: Training costs dropped dramatically

| Model | Estimated training cost | Year | |-------|------------------------|------| | GPT-4 | $100M+ | 2023 | | Llama 3.1 405B | $30-50M | 2024 | | DeepSeek V3 | $5.6M | 2024 | | DeepSeek R1 | ~$3M (estimated) | 2025 |

Sources: Meta AI, DeepSeek technical reports, industry estimates.

Training a frontier-competitive model went from $100M+ to under $5M. A 20x reduction in two years. This opened the door for smaller labs and non-US organizations.

Theme 3: Open source reached parity

| Metric | Open source best (mid-2025) | Closed source best | |--------|---------------------------|-------------------| | Chatbot Arena Elo | 1258 (DeepSeek V3) | 1288 (Claude Opus 4) | | MATH benchmark | 97.3% (DeepSeek R1) | 96.8% (Claude Opus 4) | | SWE-bench Verified | 49.2% (DeepSeek R1) | 58.7% (Claude Opus 4) | | Cost per M tokens | $0.27 (DeepSeek V3) | $3.00 (Claude 4 Sonnet) |

Sources: LMSYS Chatbot Arena, benchmark papers, pricing pages.

On MATH, DeepSeek R1 (open) actually beats Claude Opus 4 (closed). On Chatbot Arena, the gap is 30 Elo points (close, but closed source still leads overall). On cost, open source is 11x cheaper.

Theme 4: Reasoning models changed everything

| Before reasoning models (early 2024) | After (late 2025) | |--------------------------------------|-------------------| | MATH accuracy: 76% (GPT-4) | 97.3% (DeepSeek R1) | | AIME accuracy: ~35% (GPT-4) | 79.8% (DeepSeek R1) | | Cost for hard math: $0.50/problem | $0.02/problem |

Sources: OpenAI, DeepSeek, Anthropic, Stanford HAI.

Reasoning models (o1/o3, R1, Claude thinking) proved that inference-time compute is a new scaling axis. Instead of making the model bigger, you let it think longer. The quality gains on math and coding were dramatic.

Theme 5: The inference cost collapse continued

| Best available price (output/M tokens) | Jan 2025 | Dec 2025 | |----------------------------------------|----------|----------| | Frontier quality | $15.00 | $10.00 | | Near-frontier quality | $1.10 | $0.60 | | Good-enough quality | $0.40 | $0.30 |

Sources: Provider pricing pages, Artificial Analysis.

The cheapest "good enough" model went from $0.40/M to $0.30/M. The near-frontier tier from $1.10 to $0.60. Continued compression, but the rate of decline is slowing.

The year in numbers

| Statistic | Value | |-----------|-------| | Models released on Hugging Face | ~4,200 (notable: ~300) | | Total Chatbot Arena votes (2025) | ~2M | | Lowest price per M output tokens (frontier-class) | $0.30 | | Highest SWE-bench Verified score | 58.7% (Claude Opus 4) | | Countries with frontier-competitive models | 4 (US, China, France, UAE) | | AI API providers monitored in my data | 25 | | GPU resale price change (H100) | -40% | | Estimated global AI inference electricity | ~20 TWh |

Sources: Hugging Face, LMSYS, Epoch AI, Stanford HAI, my tracking data.

My biggest hits and misses for 2025

| Prediction | Result | |-----------|--------| | "Open source will match GPT-4 by mid-2025" | Hit (happened by Q1) | | "API prices will fall 50%" | Hit (fell 50-75%) | | "Reasoning models will plateau after o1" | Miss (massive improvement) | | "NVIDIA will face real competition" | Partial (competition grew, NVIDIA still leads) | | "Peak model releases in 2025" | Hit (Q4 2024 was peak) |

Looking ahead to 2026

I'm most curious about:

| Question | Why it matters | |---------|---------------| | Does GPT-5 reset the frontier? | OpenAI has been quiet about their next general model | | How far can reasoning models go? | Are there diminishing returns on thinking time? | | Does AI hardware competition accelerate? | AMD and Intel closed the gap in 2025 | | What happens when inference is essentially free? | Pricing floor approaching for commodity models |

2025 was the year I stopped asking "which model is biggest?" and started asking "which model is smartest per dollar?" The data made me change the question.

Quality beat scale. Efficiency beat brute force. That's the story of 2025.

If you found this interesting, you might also like:

-- dataku