Industry TrendsOctober 11, 20216 min read

Hugging Face just hit 10,000 models. Here's what the model zoo looks like.

I scraped the Hugging Face model hub and categorized all 10,000+ models by type, language, and download count. Text generation is only 8% of the total. The real king is NER.

Hugging Face crossed 10,000 models on their model hub last week, and I did what I always do when a big round number shows up: I pulled the data and looked at what's actually in there.

The composition of the model zoo tells you a lot about what the ML community is actually building (as opposed to what Twitter thinks the ML community is building). And the story is not "everyone is building GPT-3 clones."

The breakdown by task type

I categorized all models listed on the Hugging Face model hub by their primary task:

| Task type | Model count | Share | Avg monthly downloads | |-----------|------------|-------|----------------------| | Token classification (NER) | 2,340 | 22.1% | 1,200 | | Text classification | 1,890 | 17.8% | 2,800 | | Translation | 1,420 | 13.4% | 3,400 | | Fill-mask (MLM) | 1,150 | 10.9% | 890 | | Question answering | 980 | 9.3% | 1,600 | | Text generation | 850 | 8.0% | 14,200 | | Summarization | 520 | 4.9% | 4,100 | | Speech/audio | 410 | 3.9% | 2,200 | | Image classification | 380 | 3.6% | 1,800 | | Other (sentence similarity, zero-shot, etc.) | 660 | 6.2% | 1,100 |

Named Entity Recognition (NER) is the king of the model zoo. 22.1% of all models. That's more than double the number of text generation models.

But look at the average monthly downloads column. Text generation models average 14,200 downloads per model. NER models average 1,200. There are many NER models, but each one is niche. There are few text generation models, but they're massively popular.

This is the long tail vs the hits. NER is the long tail: thousands of specialized models for specific languages, domains, and entity types. Text generation is the hits: a handful of models that everyone downloads.

Language distribution

| Language | Model count | Share | |----------|------------|-------| | English | 4,870 | 46.0% | | Multilingual | 1,240 | 11.7% | | Chinese | 680 | 6.4% | | French | 520 | 4.9% | | German | 470 | 4.4% | | Spanish | 390 | 3.7% | | Arabic | 280 | 2.6% | | Russian | 210 | 2.0% | | Japanese | 190 | 1.8% | | Portuguese | 170 | 1.6% | | Other (100+ languages) | 1,580 | 14.9% |

English dominates, which surprises nobody. But the 14.9% "Other" category is interesting. It contains models for over 100 languages, many with only 1-3 models each. Yoruba, Swahili, Tamil, Welsh. These are often the result of individual researchers or small teams building NLP tools for their own language.

Japanese at 1.8% (190 models) caught my eye. As someone who follows Japanese tech, this feels low. Japan's ML community is active but apparently prefers other platforms or keeps models internal. Wakatta (I get it), but I wish there were more.

The most downloaded models

The top 10 most downloaded models tell a different story than the category distribution:

| Rank | Model | Task | Monthly downloads | |------|-------|------|------------------| | 1 | bert-base-uncased | Fill-mask | 4.2M | | 2 | gpt2 | Text generation | 2.8M | | 3 | distilbert-base-uncased | Fill-mask | 2.1M | | 4 | roberta-base | Fill-mask | 1.4M | | 5 | bert-base-cased | Fill-mask | 1.1M | | 6 | t5-small | Text2text | 890K | | 7 | xlm-roberta-base | Fill-mask | 780K | | 8 | facebook/bart-large-cnn | Summarization | 720K | | 9 | distilgpt2 | Text generation | 680K | | 10 | bert-large-uncased | Fill-mask | 540K |

BERT and its variants absolutely dominate downloads. The top 10 is mostly BERT variants, plus GPT-2 and T5. These are the workhorses. They're what people actually use in production.

GPT-3 isn't on this list because it's not on Hugging Face (it's API-only through OpenAI). GPT-J-6B, the open source alternative, has around 45K monthly downloads. Popular for an open source LLM, but a rounding error compared to BERT.

What this tells us about the real ML market

The Hugging Face data paints a picture that's starkly different from the AI discourse on Twitter.

On Twitter, AI is about giant language models generating text. On Hugging Face, AI is about BERT models classifying text and extracting entities. The most common real-world ML use case isn't "generate a blog post." It's "find all the person names in this document" and "is this customer email positive or negative?"

This matches what I found in my Q1 funding analysis: enterprise ML tools dominate funding. The Hugging Face data confirms it from the practitioner side. People download NER and classification models because that's what their products need.

Growth rate

Hugging Face crossed 5,000 models in March 2021 and 10,000 in early October. That's a doubling in seven months. If the pace holds, they'll hit 20,000 by mid-2022.

| Milestone | Date | Months to reach | |-----------|------|-----------------| | 1,000 models | ~Aug 2020 | Baseline | | 5,000 models | ~Mar 2021 | 7 months | | 10,000 models | ~Oct 2021 | 7 months | | 20,000 models (projected) | ~May 2022 | 7 months |

The growth is remarkably consistent. And it's not just quantity. The quality and diversity of models is improving too. I see more production-ready, well-documented models now than I did six months ago. Papers With Code integration is helping: researchers link their papers directly to model weights.

This is the kaizen of open source ML. Small, consistent improvements. Not one dramatic breakthrough, but 10,000 people each contributing one model.

That compounds.


If you found this interesting, you might also like:

-- dataku

More from dataku