Data StoriesFebruary 20, 20236 min read

I counted every AI model released this quarter. Here's what I found.

Q4 2022 had 31 notable model releases. Q1 2023 is on pace for 58. The acceleration is real, and it's not just one company driving it. I categorized every single one.

I keep a spreadsheet of every notable AI model release. Have since mid-2021. It started as a personal tracking habit and turned into something I check obsessively.

Here's the trend that's been keeping me up at night.

The release cadence, quarter by quarter

| Quarter | Notable model releases | Avg per month | |---------|----------------------|---------------| | Q1 2022 | 14 | 4.7 | | Q2 2022 | 17 | 5.7 | | Q3 2022 | 22 | 7.3 | | Q4 2022 | 31 | 10.3 | | Q1 2023 (projected) | 58 | 19.3 |

Sources: My tracking spreadsheet, cross-referenced with Hugging Face model hub, arXiv papers, Papers With Code, and Epoch AI.

From 14 to 58 in a year. That's a 4x increase in the rate of model releases.

And Q1 2023 isn't even over yet. I'm projecting 58 based on the current pace of 19-20 per month through mid-February. By the time you read this, the number might be higher.

Who's releasing what

I categorized all Q1 2023 releases by organization type:

| Organization type | # of releases | % of total | Notable examples | |------------------|---------------|------------|-----------------| | Big tech (Google, Meta, Microsoft) | 12 | 21% | PaLM 2, LLaMA | | AI labs (OpenAI, Anthropic, DeepMind) | 8 | 14% | GPT-4, Claude | | Open source community | 22 | 38% | Alpaca, Vicuna, GPT4All | | Startups (Mistral, Cohere, etc.) | 11 | 19% | Various fine-tunes | | Academic | 5 | 9% | Stanford Alpaca |

The open source community is producing 38% of all notable model releases. That's new. In Q1 2022, open source was about 15% of the total. In one year, the community went from a small fraction to the single largest category.

The LLaMA effect

I need to talk about Meta's LLaMA separately because it changed the release dynamics completely.

LLaMA dropped on February 24. The weights leaked within days. And then:

| Date | Model | Based on | Organization | |------|-------|----------|-------------| | Feb 24 | LLaMA (7B/13B/30B/65B) | Original | Meta AI | | Mar 13 | Stanford Alpaca | LLaMA 7B | Stanford | | Mar 17 | GPT4All | LLaMA 7B | Nomic AI | | Mar 28 | Vicuna-13B | LLaMA 13B | LMSYS | | Mar 30 | Koala-13B | LLaMA 13B | UC Berkeley | | Apr (early) | Open Assistant | LLaMA variants | LAION |

Sources: Hugging Face model pages, GitHub repos, project announcements.

Six derivative models in about five weeks. Each one fine-tuned LLaMA for a different purpose, and each one was free.

This is what happens when a strong base model becomes available to the community. The experimentation rate explodes. Before LLaMA, the open source community was building on GPT-J and BLOOM, which were decent but clearly behind the frontier. LLaMA closed enough of the gap that fine-tuning could produce genuinely useful models.

Text vs. image vs. everything else

I also broke down Q1 2023 releases by modality:

| Modality | # of releases | % of total | |----------|---------------|------------| | Text (LLMs) | 38 | 66% | | Image generation | 9 | 16% | | Code generation | 5 | 9% | | Multimodal | 4 | 7% | | Audio/speech | 2 | 3% |

Text LLMs dominate. That's partly because fine-tuning an existing LLM is cheaper and faster than training an image model from scratch. The barrier to releasing a new text model dropped to "download LLaMA, fine-tune on a dataset, push to Hugging Face." Some of these releases took days, not months.

The quality distribution is getting weird

Not all models are equal. I tried to roughly bucket the 58 releases by quality tier:

| Quality tier | Count | % | Examples | |-------------|-------|---|---------| | Frontier (best available) | 3 | 5% | GPT-4, Claude v1, PaLM 2 | | Near-frontier | 7 | 12% | LLaMA 65B, Vicuna-13B | | Good enough for production | 15 | 26% | GPT-3.5-turbo, Alpaca | | Experimental/hobby | 33 | 57% | Various small fine-tunes |

57% of releases are experimental. That sounds bad until you realize that this category didn't exist a year ago. There was no "hobby tier" of AI model development. Now there are individual researchers and small teams releasing models on a weekly basis.

What I expected vs. what happened

Honestly? I expected the acceleration to come from big labs with more compute. I thought GPT-4 would spark a response from Google and Anthropic, and we'd see 3-4 new frontier models per quarter.

Instead, the acceleration came from the bottom. Open source. Small teams. University labs. The total number of frontier models is still about the same (2-3 per quarter). The explosion is entirely in the mid-tier and experimental categories.

The pattern looks like software development in the 2000s. A few big platforms (Windows, Linux, macOS) plus thousands of applications and libraries built on top of them. LLaMA is becoming the Linux of language models: not because it's the best, but because it's the base that everyone can build on.

One number that concerns me

Here's the number I keep staring at: 33 experimental models in one quarter, most with minimal evaluation.

Of those 33, I could find published benchmark results for only 11. The other 22 were released with either no benchmarks, vibes-based evaluation ("it feels good"), or cherry-picked examples.

We're accelerating model releases faster than we're accelerating model evaluation. That gap is going to become a problem. When there are 100 models per quarter, how does anyone know which ones are actually good?

The Hugging Face Open LLM Leaderboard is trying to solve this, but it's already showing signs of Goodhart's Law (models being optimized specifically for leaderboard benchmarks). I'll write more about that soon.

For now, the raw numbers tell the story: AI model development went from a marathon to a sprint, and the runners multiplied by 4x in a year.


If you found this interesting, you might also like:

-- dataku

More from dataku