Industry TrendsJune 6, 20226 min read

Open source AI is having a moment. Here are the download numbers.

BLOOM just launched. GPT-NeoX is out. I pulled download stats from Hugging Face for every open source LLM. The adoption curves are starting to look serious.

Something shifted in the last three months and I can see it in the numbers.

Open source language models went from curiosity projects to models people actually use in production. The download counts on Hugging Face tell the story better than any press release.

The current leaderboard

I pulled download statistics from the Hugging Face model hub for every open source LLM with more than 1 billion parameters, as of June 1, 2022:

| Model | Organization | Parameters | Monthly downloads | Total downloads (est.) | |-------|-------------|-----------|-------------------|----------------------| | GPT-2 | OpenAI | 1.5B | 1,200,000+ | 12,000,000+ | | GPT-J-6B | EleutherAI | 6B | 340,000 | 2,800,000 | | GPT-NeoX-20B | EleutherAI | 20B | 85,000 | 280,000 | | BLOOM-176B | BigScience | 176B | 42,000* | 42,000* | | GPT-Neo 2.7B | EleutherAI | 2.7B | 180,000 | 1,500,000 | | OPT-175B | Meta AI | 175B | 28,000 | 95,000 | | OPT-66B | Meta AI | 66B | 35,000 | 120,000 | | OPT-30B | Meta AI | 30B | 48,000 | 160,000 |

*BLOOM launched May 2022, so this is essentially launch-month data.

GPT-2 still dominates raw downloads because it's small enough to run on a laptop and has been around since 2019. But the action is in the 6B-20B range. GPT-J's 340,000 monthly downloads mean that roughly 11,000 people (or pipelines) are downloading this model every single day.

The growth curves

The month-over-month trends tell a more interesting story than the absolute numbers:

| Model | Mar 2022 downloads | Jun 2022 downloads | 3-month growth | |-------|-------------------|-------------------|----------------| | GPT-J-6B | 180,000 | 340,000 | +88.9% | | GPT-Neo 2.7B | 95,000 | 180,000 | +89.5% | | GPT-NeoX-20B | N/A (launched Apr) | 85,000 | N/A | | OPT-30B | N/A (launched May) | 48,000 | N/A |

GPT-J and GPT-Neo roughly doubled their downloads in three months. That's not a blip. That's an adoption curve.

BLOOM: the biggest open model ever

BigScience's BLOOM deserves its own section because it's a genuinely unusual project. 176 billion parameters, trained by an international collaboration of over 1,000 researchers across 60 countries. It supports 46 natural languages and 13 programming languages.

| BLOOM detail | Spec | |-------------|------| | Parameters | 176B | | Training data | 1.6TB (ROOTS corpus) | | Languages | 46 natural + 13 programming | | Training hardware | 384 NVIDIA A100 80GB GPUs | | Training time | ~3.5 months | | Estimated training cost | $2-5M (on Jean Zay supercomputer) | | License | RAIL (Responsible AI License) |

The training was done on France's Jean Zay supercomputer, which means the compute cost was effectively subsidized by the French government. That's important context. A $2-5M training run on commercial cloud pricing would cost 2-3x more.

The 42,000 downloads in the first month are solid for a 176B model. You need serious hardware to run it (minimum 8x A100 GPUs for inference), so the download base is naturally smaller than GPT-J.

Meta's OPT: the quiet contender

Meta AI released OPT (Open Pre-trained Transformer) in May with the full model weights from 125M to 175B parameters. The download numbers are modest compared to EleutherAI's models, but Meta has two advantages: institutional credibility and a full suite of model sizes.

| OPT variant | Parameters | Monthly downloads | |-------------|-----------|-------------------| | OPT-125M | 125M | 62,000 | | OPT-350M | 350M | 41,000 | | OPT-1.3B | 1.3B | 55,000 | | OPT-6.7B | 6.7B | 38,000 | | OPT-30B | 30B | 48,000 | | OPT-66B | 66B | 35,000 | | OPT-175B | 175B | 28,000 |

The distribution is surprisingly flat. The small models (125M, 1.3B) and the medium model (30B) are the most downloaded. There's a clear split between "I want to experiment on my laptop" and "I want the biggest model that fits on a single server."

What's driving the adoption

I see three forces in the data.

First, the API-to-self-hosted migration. Several people in the Hugging Face community forums have posted about switching from OpenAI's API to self-hosted GPT-J. The reasons: cost savings at scale, no content filtering, and independence from a single provider. This matches the cost data I published back in July 2021.

Second, fine-tuning. A huge chunk of GPT-J and GPT-Neo downloads are from people fine-tuning the models on custom datasets. The open weights make this possible. You can't fine-tune GPT-3 the same way (OpenAI's fine-tuning API is more limited).

Third, research. Academic researchers need models they can inspect, probe, and modify. Closed models are black boxes. Open models let you look at the weights, the activations, the attention patterns. Every ML research lab doing interpretability or alignment work needs open source models.

The gap is still real

I should be honest about what the numbers don't show. Quality-wise, GPT-3 175B (Davinci) still beats every open source model on most benchmarks. The gap has narrowed at the 6B parameter scale (as I showed in my GPT-J comparison), but at the 175B scale, OpenAI's model has the advantage of better training data curation and more RLHF fine-tuning.

The open source community is building the raw models. The "make them actually useful and safe" layer (RLHF, safety filtering, instruction following) is still mostly a closed-source advantage.

But the trend is clear. The downloads are going up, the model sizes are going up, and the number of organizations releasing open weights is growing. A year ago, EleutherAI was essentially alone. Now it's EleutherAI, Meta, BigScience, and more coming.

The data says open source AI isn't a sideshow anymore. It's becoming the main stage.


If you found this interesting, you might also like:

-- dataku

More from dataku