2024 AI data roundup: the year of commoditization

This is my fourth annual AI data roundup. Each year, the spreadsheet gets bigger and the story gets harder to summarize. 2024 was the hardest yet because so many things happened at once.

Here's my attempt: 25 numbers that tell the story of 2024.

The pricing collapse

Chart 1: API price deflation

| Date | Best frontier model (output $/M tokens) | Best budget model (output $/M tokens) | |------|---------------------------------------|--------------------------------------| | Jan 2024 | $20.00 (GPT-4 Turbo) | $0.60 (Mistral Small) | | Mar 2024 | $75.00 (Claude 3 Opus) | $1.25 (Claude 3 Haiku) | | May 2024 | $15.00 (GPT-4o) | $0.30 (Gemini 1.5 Flash) | | Jul 2024 | $15.00 (GPT-4o) | $0.60 (GPT-4o mini) | | Sep 2024 | $60.00 (o1-preview) | $0.08 (Llama 3.1 8B on Groq) | | Dec 2024 | $15.00 (Claude 3.5 Sonnet) | $0.30 (Gemini 2.0 Flash) |

Sources: Provider pricing pages throughout 2024. OpenAI, Anthropic, Google, Mistral AI, Groq.

The budget tier went from $0.60 to $0.08 per million output tokens. That's an 87% decline in 12 months.

Chart 2: The 100x chart

| Metric | Jan 2023 | Dec 2024 | Change | |--------|----------|----------|--------| | Price for GPT-4-class quality ($/M output) | $60.00 | $3.00 (hosted Llama 3.1 405B) | 20x cheaper | | Price for GPT-3.5-class quality ($/M output) | $2.00 | $0.08 (Groq Llama 3.1 8B) | 25x cheaper | | Best MMLU score (any model) | 86.4% (GPT-4) | 90.8% (o1-preview) | +4.4 pts | | Best open source MMLU | 39.3% (BLOOM) | 87.3% (Llama 3.1 405B) | +48.0 pts | | Largest context window (commercial) | 4K (GPT-4) | 1M (Gemini 1.5 Pro) | 250x larger |

Sources: Model papers, provider pricing, Hugging Face leaderboard, my tracking data.

The open source MMLU improvement (+48 points in 2 years) is the single most dramatic number in my dataset. From 39.3% (barely above random for 4-choice MMLU) to 87.3% (matching GPT-4 Turbo). In 24 months.

The open source convergence

Chart 3: Elo gap over time

| Date | Best open source Elo | Best closed source Elo | Gap | |------|---------------------|----------------------|-----| | Jan 2024 | ~1168 (Mixtral 8x7B) | ~1260 (GPT-4 Turbo) | 92 | | Apr 2024 | ~1234 (Llama 3 70B) | ~1285 (GPT-4o) | 51 | | Jul 2024 | ~1258 (Llama 3.1 405B) | ~1305 (Claude 3.5 Sonnet) | 47 | | Oct 2024 | ~1268 (Qwen 2.5 72B) | ~1320 (Claude 3.5 Sonnet new) | 52 | | Dec 2024 | ~1272 (various) | ~1325 (Claude 3.5 Sonnet) | 53 |

Sources: LMSYS Chatbot Arena leaderboard snapshots, my monthly tracking.

The gap narrowed from 92 to about 50 in the first half of 2024, then stabilized around 50 for the rest of the year. Open source converged quickly and then hit a wall. That last 50 Elo points might be the hardest to close.

Chart 4: Open source model quality by parameter count

| Parameter tier | Best MMLU (Jan 2024) | Best MMLU (Dec 2024) | Improvement | |---------------|---------------------|---------------------|------------| | Under 5B | 52% | 68.8% (Phi-3 Mini) | +16.8 pts | | 5-10B | 60.1% (Mistral 7B) | 73.0% (Llama 3.1 8B) | +12.9 pts | | 60-80B | 70.6% (Mixtral 8x7B) | 85.3% (Qwen 2.5 72B) | +14.7 pts | | 400B+ | N/A | 87.3% (Llama 3.1 405B) | New tier |

Source: Model papers, Hugging Face, my tracking data.

Every size tier improved by 12-17 MMLU points. The improvement was broad-based, not concentrated in one model size.

The model releases

Chart 5: Quarterly release velocity

| Quarter | Models released | Organizations | Open source % | |---------|----------------|---------------|---------------| | Q1 2024 | 78 | 28 | 79% | | Q2 2024 | 82 | 30 | 81% | | Q3 2024 | 74 | 26 | 83% | | Q4 2024 (proj) | ~67 | ~23 | 85% | | Full year | ~301 | ~35 | 82% |

Source: My tracking spreadsheet, Hugging Face, arXiv, company announcements.

301 notable models from 35 organizations. Open source models were 82% of releases. The pace peaked in Q2 2024 and has been slightly declining since.

Chart 6: The major model timeline

| Date | Model | What it proved | |------|-------|---------------| | Jan 8 | Mixtral 8x7B (broad availability) | MoE can match GPT-3.5 at 10x lower cost | | Feb 19 | Groq LPU demo | Custom chips can do 800+ tok/sec | | Feb 26 | Mistral Large | Europe has a frontier model | | Mar 4 | Claude 3 (Haiku/Sonnet/Opus) | 3-tier pricing works, Opus matches GPT-4 | | Apr 18 | Llama 3 8B/70B | Small new model > large old model | | Apr 23 | Phi-3 Mini (3.8B) | 3.8B can beat 7B with better data | | May 13 | GPT-4o | Better + cheaper is possible | | Jun 20 | Claude 3.5 Sonnet | Mid-tier model beats flagship | | Jul 18 | GPT-4o mini | Budget model at $0.15/M input | | Jul 23 | Llama 3.1 405B | Open source reaches GPT-4 class | | Sep 12 | o1-preview | Reasoning models: entirely new approach | | Sep 19 | Qwen 2.5 72B | Chinese models reach top tier | | Oct 22 | Claude 3.5 Sonnet (new) + computer use | AI can use a computer (58% of the time) | | Dec 11 | Gemini 2.0 Flash | Speed + quality + low cost | | Dec 26 | DeepSeek V3 | $5.6M training cost for frontier quality |

The reasoning model shift

Chart 7: Reasoning vs standard models

| Model type | MATH score range | AIME score range | Cost per problem | |-----------|-----------------|------------------|-----------------| | Standard models (GPT-4o, Claude 3.5 Sonnet) | 60-78% | 13-20% | $0.009-0.022 | | Reasoning models (o1-preview) | 83% | 74% | $0.78 | | Reasoning models (o1-mini) | 70% | 57% | $0.08 |

Sources: OpenAI o1 system card, model papers, my cost calculations.

o1-preview on AIME (math competition): 74.4%. GPT-4o: 13.4%. That 61-point gap is the single largest capability jump from any model release in 2024. Reasoning models didn't just improve on the existing curve. They created a new curve.

The coding assistant explosion

Chart 8: AI coding tool adoption

| Tool | Market share (est. Dec 2024) | Change from Jan 2024 | |------|----------------------------|---------------------| | GitHub Copilot | 35% | -8% | | Cursor | 24% | +18% | | Continue | 8% | +5% | | Aider | 6% | +4% | | Sourcegraph Cody | 5% | +1% | | Others | 22% | Fragmented |

Source: My estimates based on developer surveys, download data, and industry reports.

Cursor went from niche to 24% market share in one year. Copilot is still the leader but lost 8 points. The market is fragmenting.

The inference provider market

Chart 9: Provider market map

| Provider category | Count (Dec 2024) | Examples | |------------------|-----------------|---------| | Major cloud (with own models) | 3 | OpenAI, Anthropic, Google | | Major cloud (hosting open source) | 3 | AWS, Azure, GCP | | Specialized inference | 6 | Groq, Fireworks, Together, Cerebras, Baseten, Modal | | European providers | 2 | Mistral AI, Scaleway | | Chinese providers | 4 | Alibaba, DeepSeek, Baichuan, MiniMax | | Self-hosting tools | 5+ | vLLM, llama.cpp, Ollama, LM Studio, TGI | | Total providers | ~23+ | |

Source: My tracking, December 2024.

23+ ways to run an LLM. Compare to January 2023: essentially 1 (OpenAI API). The market grew 23x in two years.

The DeepSeek surprise (late December)

Chart 10: Training cost estimates

| Model | Estimated training cost | Source | Parameters | Quality tier | |-------|----------------------|--------|-----------|-------------| | GPT-4 | $100M+ | Industry estimates | Unknown | Frontier (2023) | | Claude 3 Opus | $50-100M est. | Industry estimates | Unknown | Frontier (2024) | | Llama 3.1 405B | $30-50M est. | Compute analysis | 405B | Frontier | | DeepSeek V3 | $5.6M (reported) | DeepSeek technical report | 671B MoE | Frontier |

Sources: DeepSeek V3 technical report, industry analyst estimates, SemiAnalysis.

If DeepSeek V3's $5.6M training cost is accurate (and the paper's methodology is detailed enough to be credible), it means frontier model training is 10-20x cheaper than the industry assumed. This is the most consequential data point of late 2024.

My 10 key takeaways

| # | Takeaway | Supporting data | |---|---------|----------------| | 1 | AI became a commodity | Budget API went from $2.00 to $0.08/M tokens | | 2 | Open source reached parity | Llama 3.1 405B matches GPT-4 on 9/10 benchmarks | | 3 | Small models got very good | Phi-3 3.8B beats Llama 2 13B on MMLU | | 4 | Reasoning models are a new category | o1 AIME: 74% vs GPT-4o: 13% | | 5 | Coding assistants went mainstream | 50%+ of developers now use one | | 6 | The China-US AI gap narrowed | Qwen 2.5 and DeepSeek V3 are frontier-competitive | | 7 | Context windows stopped mattering | 1M tokens available, 200K is standard | | 8 | Computer use appeared (barely) | 58% success rate, but it exists | | 9 | Training costs may be much lower than assumed | DeepSeek V3: $5.6M | | 10 | Model release pace may have peaked | Q4 2024 releases slightly down from Q4 2023 |

Looking ahead

2021 was the year of GPT-3. 2022 was image generation and ChatGPT. 2023 was GPT-4 and the open source explosion. 2024 was commoditization.

What's 2025? If I had to bet: it's the year AI agents either work or don't. Computer use, tool use, multi-step reasoning, autonomous task completion. The models are good enough. The question is whether the systems around them (memory, planning, error recovery) can make agents reliable enough for real work.

I'll be tracking the data. Four years of spreadsheets and counting. My ikigai remains: finding the story in the numbers.

If you found this interesting, you might also like:

-- dataku