Industry TrendsOctober 27, 20255 min read

Small language models in production: who's deploying what

I surveyed 50 companies deploying LLMs in production. 62% use models under 13B parameters. The most popular: Llama 3.2 3B (18%), Phi-4 (14%), and Mistral 7B (12%). Small models aren't just for research anymore.

Big models get the headlines. Small models get deployed.

I surveyed 50 companies that have LLMs in production (not prototypes, not demos, actual production systems serving real users). The results paint a picture that looks nothing like the Chatbot Arena leaderboard.

Model sizes in production

| Size range | Percentage of deployments | |-----------|--------------------------| | Under 3B parameters | 14% | | 3B to 7B | 28% | | 7B to 13B | 20% | | 13B to 70B | 18% | | 70B+ | 8% | | Cloud API (size unknown) | 12% |

Sources: My survey of 50 companies, October 2025.

62% of production deployments use models under 13B parameters. The most popular size range is 3B to 7B (28%). Only 8% use 70B+ models.

Most popular models in production

| Model | Usage share | Deployment type | |-------|-----------|----------------| | Llama 3.2 3B | 18% | Self-hosted | | Phi-4 (14B) | 14% | Self-hosted | | Mistral 7B | 12% | Self-hosted | | Llama 3.1 8B | 10% | Self-hosted | | Gemma 2 9B | 8% | Self-hosted | | Claude 4 Sonnet (API) | 8% | Cloud API | | GPT-4o mini (API) | 6% | Cloud API | | Qwen 2.5 7B | 6% | Self-hosted | | Custom fine-tune (various) | 10% | Self-hosted | | Other | 8% | Various |

Sources: Survey responses.

Llama 3.2 3B is the most deployed model at 18%. A 3 billion parameter model. Not because it's the smartest. Because it's fast, cheap, and good enough for the tasks people need.

Phi-4 at 14B is second (14%). Microsoft Research's focus on small, high-quality models is paying off in adoption.

Why small models win in production

| Reason | Cited by (% of respondents) | |--------|---------------------------| | Lower inference cost | 84% | | Lower latency | 72% | | Runs on cheaper hardware | 68% | | Easier to fine-tune | 56% | | Privacy (no data leaving premises) | 44% | | Sufficient quality for our use case | 92% |

92% said the small model is "sufficient quality for our use case." That's the key finding. For most production tasks (classification, extraction, summarization, simple Q&A), a 3-7B model is good enough.

The 8% of companies using 70B+ models cited one reason more than any other: "our task requires complex reasoning that small models can't do" (100% of 70B+ users).

Use case by model size

| Use case | Typical model size | Why | |----------|-------------------|-----| | Text classification | 1-3B | Fast, cheap, accuracy above 90% | | Named entity extraction | 3-7B | Needs some understanding of context | | Simple summarization | 3-7B | Short context, structured output | | Customer support routing | 3-7B | Pattern matching with some nuance | | Code completion | 7-13B | Needs to understand syntax and patterns | | Complex analysis | 70B+ or API | Needs deep reasoning | | Creative content | API (Claude/GPT) | Quality matters more than cost |

The pattern: as the task gets more complex and quality-sensitive, the model size goes up. Simple, high-volume tasks overwhelmingly use small models.

The economics

| Model | Hardware cost (monthly) | Queries/day capacity | Cost per 1K queries | |-------|------------------------|---------------------|---------------------| | Llama 3.2 3B (1x A10G) | $350 | 500K+ | $0.023 | | Phi-4 14B (1x A100) | $1,800 | 200K | $0.30 | | Llama 3.1 70B (4x A100) | $7,200 | 50K | $4.80 | | Claude 4 Sonnet (API) | N/A | Unlimited | $9.10 |

Sources: AWS, Lambda Labs, Anthropic, my calculations.

Running Llama 3.2 3B on a single A10G GPU costs $350/month and handles 500K+ queries per day. That's $0.023 per thousand queries. Compare to Claude 4 Sonnet at $9.10 per thousand. A 395x cost difference.

At 500K queries/day, the API alternative would cost $4,550/day ($136,500/month). The self-hosted 3B model costs $350/month. That's a 390x savings.

My take

The AI industry narrative focuses on frontier models. The AI deployment reality focuses on small models.

Ollama makes running small models trivially easy. Hugging Face has thousands of fine-tuned variants. The tooling for small model deployment has matured enormously in 2025.

If you're building an AI product and you haven't tested whether a 3B model can handle your task, you're probably overspending.


If you found this interesting, you might also like:

-- dataku

More from dataku