Mixtral 8x7B: the MoE model that changes the economics of inference
Mistral dropped Mixtral via a magnet link (no paper, no blog post, just a torrent). The benchmarks leaked within hours. A mixture-of-experts model at GPT-3.5 quality with 12B active parameters? The inference cost math is wild.
Mistral AI did it again. Dropped a model via magnet link on Twitter. No paper. No blog post. No press tour. Just a torrent link and a winking emoji.
This time it's Mixtral 8x7B, a mixture-of-experts (MoE) model. And the inference economics are going to change things.
What is MoE and why should you care?
A quick explanation for the data-focused reader.
A standard 7B model has 7 billion parameters. Every input token passes through all 7 billion parameters. A mixture-of-experts model has multiple "expert" sub-networks, and only a few are activated for each token.
| Architecture | Total parameters | Active parameters per token | Memory needed | Compute per token | |-------------|-----------------|---------------------------|---------------|-------------------| | Standard 7B | 7B | 7B (100%) | ~14GB FP16 | 1x | | Standard 13B | 13B | 13B (100%) | ~26GB FP16 | ~1.9x | | Mixtral 8x7B (MoE) | 46.7B | ~12.9B (~28%) | ~90GB FP16 | ~1.8x |
Source: Mixtral model card on Hugging Face, community analysis.
Mixtral has 8 expert networks, each roughly 7B parameters. For every token, a router selects 2 of the 8 experts. So 46.7B total parameters, but only ~12.9B are active at any time.
This matters because inference speed is determined by active parameters, not total parameters. Mixtral does the compute of a 13B model with the knowledge capacity of a 47B model.
The benchmark numbers
Within hours of the torrent appearing, people on Hugging Face and Twitter were running benchmarks. Here's what emerged:
| Benchmark | Mixtral 8x7B | Llama 2 70B | GPT-3.5-turbo | Mistral 7B | |-----------|-------------|-------------|---------------|------------| | MMLU (5-shot) | 70.6% | 68.9% | 70.0% | 60.1% | | HellaSwag | 86.5% | 85.3% | 85.5% | 81.3% | | ARC Challenge | 65.7% | 64.6% | 85.2% | 55.5% | | WinoGrande | 77.2% | 77.4% | N/A | 75.3% | | HumanEval | 34.2% | 29.9% | 48.1% | 30.5% | | GSM8K | 58.4% | 56.8% | 57.1% | 35.4% | | TruthfulQA | 46.7% | 44.9% | N/A | 42.2% |
Sources: Community evaluations on Hugging Face, cross-referenced with LMSYS early results.
Look at the MMLU column. Mixtral 8x7B: 70.6%. GPT-3.5-turbo: 70.0%. Llama 2 70B: 68.9%.
Mixtral just beat GPT-3.5 on MMLU. An open source model. Available via torrent.
And on GSM8K (math), Mixtral scores 58.4% vs GPT-3.5's 57.1%. On HellaSwag, Mixtral leads at 86.5%. The model is GPT-3.5 quality across the board, with a slight edge on some benchmarks.
The inference economics (this is the big part)
Here's why MoE changes the game. Mixtral matches GPT-3.5 quality, but look at the inference costs:
| Model | Active params | Tokens/sec (A100) | Cost/1M tokens (self-hosted) | Quality (MMLU) | |-------|--------------|-------------------|------------------------------|---------------| | GPT-3.5-turbo (API) | Unknown | N/A (API) | $2.00 | 70.0% | | Llama 2 70B | 65.2B | ~25 | ~$0.40 | 68.9% | | Mixtral 8x7B | 12.9B active | ~55 | ~$0.18 | 70.6% | | Mistral 7B | 7.2B | ~150 | ~$0.08 | 60.1% |
Sources: Community throughput tests, Together AI benchmarks, my cost calculations based on A100 rental at $1.50/hr.
Mixtral generates at ~55 tokens/second on an A100. That's over 2x faster than Llama 2 70B (~25 tokens/second) because Mixtral only activates 12.9B parameters per token vs 65.2B.
At $0.18 per million tokens, Mixtral is:
- 11x cheaper than the GPT-3.5-turbo API ($2.00)
- 2.2x cheaper than Llama 2 70B self-hosted ($0.40)
- At equal or better quality on most benchmarks
The memory problem
There's a catch. Mixtral has 46.7B total parameters. Even though only 12.9B are active per token, all 46.7B need to be in memory because the router decides which experts to use at inference time.
| Model | FP16 memory | 4-bit quantized memory | Fits on RTX 4090 (24GB)? | |-------|------------|----------------------|--------------------------| | Mistral 7B | ~14.5GB | ~4.3GB | Yes (FP16 and 4-bit) | | Llama 2 13B | ~26GB | ~7.5GB | Yes (4-bit only) | | Mixtral 8x7B | ~90GB | ~26GB | Barely (4-bit only) | | Llama 2 70B | ~130GB | ~38GB | No |
Mixtral at 4-bit quantization just barely fits on a 24GB consumer GPU. It's tight, and performance will be limited. For comfortable running, you want 2x RTX 4090 or a single A100.
This is better than Llama 2 70B (which doesn't fit on a consumer GPU at all), but worse than Mistral 7B (which fits easily). The MoE trade-off: more knowledge capacity, but more VRAM needed.
Early hosting prices
The hosting providers moved fast:
| Provider | Mixtral 8x7B price ($/M tokens) | vs GPT-3.5-turbo ($2.00) | |----------|-------------------------------|--------------------------| | Together AI | $0.60 | 3.3x cheaper | | Perplexity AI pplx-api | $0.28 | 7.1x cheaper | | Anyscale | $0.50 | 4.0x cheaper | | Self-hosted (A100) | ~$0.18 | 11.1x cheaper |
Sources: Provider pricing pages, December 2023.
Perplexity AI is offering Mixtral at $0.28 per million tokens. That's 7x cheaper than GPT-3.5-turbo for comparable quality. The self-hosted cost is even lower.
What Mixtral means for the market
I see three immediate implications:
1. GPT-3.5-turbo's value proposition just collapsed.
Before Mixtral: GPT-3.5-turbo was the price/performance king. $2/M tokens with good quality, no infrastructure needed.
After Mixtral: you can get the same quality for $0.28-0.60/M tokens via hosted APIs, or $0.18/M self-hosted. The only remaining advantage of GPT-3.5-turbo is convenience and tooling (fine-tuning, function calling, etc.).
2. MoE is the architecture to watch.
If you can get 70B-class quality at 13B-class inference cost, the economics of every LLM application change. I expect to see more MoE models in 2024. The trade-off (higher memory, lower compute) is favorable for most deployment scenarios.
3. Mistral AI is for real.
Two model releases in three months. Both best-in-class for their size. Both released as open source (Apache 2.0 license). A company that raised $113M at seed and has already shipped two models that compete with or beat much larger competitors. The European AI challenger narrative is no longer hypothetical.
The magnet link release style
I should say something about how Mixtral was released. No paper. No blog post (initially). A torrent magnet link on Twitter.
This is either:
- A deliberate PR strategy (the AI research community loves an iconoclastic release)
- A sign that Mistral moves so fast they don't have time for traditional announcements
- Both
It worked. Mixtral dominated AI Twitter for 72 hours. The community benchmarked it, hosted it, quantized it, and fine-tuned it faster than any formal launch process could have achieved. Speed of adoption benefits from speed of release.
I kind of love it. Just drop the weights and let the data speak for itself. My kind of energy.
If you found this interesting, you might also like:
- DALL-E's first images vs what people expected: a data comparison
- GPT-3 vs GPT-J: the first real open source challenger, in data
- Google's PaLM has 540 billion parameters. Let me put that number in context.
- Midjourney v3 vs DALL-E 2: 100 prompts, head to head
- ChatGPT vs GPT-3: same model family, wildly different results. The data.
-- dataku