Every model released in 2022 so far, in one table
47 notable models in 9 months. I put them all in a table with release date, parameters, training data size, and whether they're open or closed. The pattern is hard to miss.
I keep a spreadsheet. You knew that already.
Every time a notable AI model drops, I add a row. Release date, organization, parameter count, training data size, architecture, and whether the weights are open or closed. I started this in January and as of September 12, 2022, there are 47 entries.
Forty-seven models in nine months. That's one every 5.7 days.
Let me show you the table, and then let me show you what the patterns look like.
The full table (language models, image models, multimodal)
| # | Model | Organization | Date | Type | Params | Training data | Open? | |---|-------|-------------|------|------|--------|--------------|-------| | 1 | LaMDA 2 | Google | Jan 2022 | Language | 137B | Undisclosed | No | | 2 | DALL-E 2 | OpenAI | Jan 2022 | Image | ~3.5B | Undisclosed | No | | 3 | InstructGPT | OpenAI | Jan 2022 | Language | 175B | RLHF on GPT-3 | No | | 4 | Megatron-Turing 530B | NVIDIA/Microsoft | Feb 2022 | Language | 530B | 339B tokens | No | | 5 | Chinchilla | DeepMind | Mar 2022 | Language | 70B | 1.4T tokens | No | | 6 | PaLM | Google | Apr 2022 | Language | 540B | 780B tokens | No | | 7 | DALL-E 2 (public) | OpenAI | Apr 2022 | Image | ~3.5B | Undisclosed | No | | 8 | Imagen | Google | May 2022 | Image | ~4.6B | Internal data | No | | 9 | Flamingo | DeepMind | Apr 2022 | Multimodal | 80B | Undisclosed | No | | 10 | OPT-175B | Meta AI | May 2022 | Language | 175B | 180B tokens | Yes | | 11 | GPT-NeoX-20B | EleutherAI | Apr 2022 | Language | 20B | The Pile (800GB) | Yes | | 12 | Gato | DeepMind | May 2022 | Multimodal/Agent | 1.2B | Multi-task | No | | 13 | BLOOM | BigScience | May 2022 | Language | 176B | 1.6TB (ROOTS) | Yes | | 14 | Parti | Google | Jun 2022 | Image | 20B | Undisclosed | No | | 15 | Minerva | Google | Jun 2022 | Language (math) | 540B | Math-focused | No | | 16 | LLM.int8() | U of Washington | Jul 2022 | Quantization | N/A | N/A | Yes | | 17 | Midjourney v3 | Midjourney | Jul 2022 | Image | Undisclosed | Undisclosed | No | | 18 | Stable Diffusion v1 | Stability AI / CompVis | Aug 2022 | Image | ~890M | LAION-5B subset | Yes | | 19 | Whisper | OpenAI | Sep 2022 | Speech | 1.5B | 680K hrs audio | Yes | | 20 | text-davinci-002 | OpenAI | Jan 2022 | Language | ~175B | RLHF | No | | 21 | code-davinci-002 | OpenAI | Mar 2022 | Code | ~175B | Code-focused | No | | 22 | AlexaTM 20B | Amazon | Aug 2022 | Language | 20B | 1.3T tokens | Partial | | 23 | Cedille | Coterie | Mar 2022 | Language (French) | 6B | French corpus | Yes | | 24 | YaLM 100B | Yandex | Jun 2022 | Language | 100B | 1.7TB | Yes | | 25 | GLM-130B | Tsinghua | Aug 2022 | Language | 130B | 400B tokens | Yes | | 26 | Cohere Command | Cohere | Feb 2022 | Language | Undisclosed | Undisclosed | No | | 27 | AI21 Jurassic-2 | AI21 Labs | Mar 2022 | Language | 178B (est.) | Undisclosed | No |
(Table continues, but you get the picture. The full 47 entries are on my spreadsheet. I'm showing the most notable 27 here because even my obsessive nature has limits for MDX formatting.)
The pattern: open vs. closed
| Quarter | Total models | Open source | Closed | Open % | |---------|-------------|-------------|--------|--------| | Q1 2022 (Jan-Mar) | 11 | 2 | 9 | 18.2% | | Q2 2022 (Apr-Jun) | 16 | 5 | 11 | 31.3% | | Q3 2022 (Jul-Sep) | 20 | 8 | 12 | 40.0% |
The open source share is climbing. From 18% in Q1 to 40% in Q3. The absolute number of closed models stayed roughly steady (9, 11, 12), while open source models more than quadrupled (2, 5, 8).
The inflection point was May-June when Meta released OPT and BigScience released BLOOM. Two 175B+ parameter models, fully open, in the same month. That had never happened before.
The pattern: who's releasing what
| Organization | Models released (2022) | Open source? | |-------------|----------------------|-------------| | Google/DeepMind | 9 | 0 of 9 | | OpenAI | 7 | 1 of 7 (Whisper) | | Meta AI | 4 | 4 of 4 | | EleutherAI | 3 | 3 of 3 | | Stability AI | 2 | 2 of 2 | | BigScience | 1 | 1 of 1 |
Google and DeepMind have released 9 models this year and open-sourced zero. OpenAI opened Whisper (speech) but nothing else. Meta AI has open-sourced everything. EleutherAI and Stability AI are open by design.
The split is clear: the frontier labs (Google, DeepMind, OpenAI) keep their best models closed. The challengers (Meta, EleutherAI, Stability AI, BigScience) use openness as a strategy.
The pattern: parameter counts are plateauing
This one surprised me. I expected parameter counts to keep climbing exponentially. They're not.
| Half-year period | Largest model released | Parameters | |-----------------|----------------------|-----------| | H2 2020 | GShard | 600B (MoE) | | H1 2021 | Switch Transformer | 1.6T (MoE) | | H2 2021 | Megatron-Turing NLG | 530B (dense) | | H1 2022 | PaLM | 540B (dense) | | H2 2022 (so far) | None larger than PaLM | 540B |
The largest dense model is still around 540B. No one has released a 1T+ dense model. The Chinchilla scaling laws (more data, not more parameters) seem to be influencing the field. Multiple labs are now focusing on training efficiency rather than parameter count.
The training data arms race
This is the trend that excites me most. While parameter counts plateaued, training dataset sizes are exploding:
| Dataset | Release | Size | Notable for | |---------|---------|------|-------------| | The Pile | 2020 | 800GB | First large open dataset | | ROOTS (BLOOM) | 2022 | 1.6TB | Multilingual, curated | | LAION-5B | 2022 | 5.85B image-text pairs | Largest open image dataset | | Chinchilla training data | 2022 | 1.4T tokens | Proved more data > more params |
LAION-5B alone is 240TB of image-text data. The open data movement is making it possible for anyone to train competitive models without access to proprietary datasets from Google or OpenAI.
What I didn't expect
Honestly? I didn't expect 47 models in 9 months. When I started the spreadsheet in January, I figured I'd log maybe 20-25 by year end. We hit that in June.
The pace is accelerating. January-March averaged one model every 8.2 days. July-September is averaging one every 4.6 days. By the time I write the 2022 year-end roundup, we'll likely be past 60.
I'll keep the spreadsheet. Someone has to count.
If you found this interesting, you might also like:
- 5 charts that explain why GPU prices went insane in 2021
- AI research papers published in 2021: a mid-year count
- My 2021 AI data roundup: the 10 numbers that mattered most
- The training cost curve is doing something weird
- I tracked AI image generation quality over 6 months. The improvement rate is scary.
-- dataku