Llama 4 Scout and Maverick: Meta's MoE play, in data
Meta went mixture-of-experts with Llama 4. Scout is 17B active parameters from 109B total. Maverick is 17B from 400B. I benchmarked both against Llama 3.1 70B. The efficiency gains are exactly what the MoE math predicts.
Meta finally went MoE.
After three generations of dense transformers, Meta AI released Llama 4 with a mixture-of-experts architecture. Two models: Scout (109B total, 17B active) and Maverick (400B total, 17B active).
Both use only 17 billion active parameters per token. The rest sit idle until the routing layer calls them up. I benchmarked both against the previous generation.
The architecture comparison
| Spec | Llama 4 Scout | Llama 4 Maverick | Llama 3.1 70B | Llama 3.1 405B | |------|---------------|------------------|---------------|----------------| | Total parameters | 109B | 400B | 70B | 405B | | Active parameters | 17B | 17B | 70B | 405B | | Architecture | MoE (16 experts, 1 active) | MoE (128 experts, 1 active) | Dense | Dense | | Context window | 10M tokens | 128K | 128K | 128K | | Training tokens | Not disclosed | Not disclosed | 15T | 15T |
Sources: Meta AI Llama 4 announcement and technical details, Hugging Face model cards.
Scout's 10 million token context window is the most eye-catching spec. That's 10x larger than any open model and roughly 10x what Google offers with Gemini. Whether it's actually useful at that length is a different question (I'll test it).
Benchmark results
| Benchmark | Scout (17B active) | Maverick (17B active) | Llama 3.1 70B | Llama 3.1 405B | |-----------|--------------------|-----------------------|---------------|----------------| | MMLU | 79.6% | 85.5% | 86.0% | 87.3% | | GPQA Diamond | 52.3% | 61.8% | 46.7% | 51.1% | | HumanEval | 74.4% | 82.3% | 80.5% | 89.0% | | IFEval | 82.4% | 85.3% | 87.5% | 86.0% | | SWE-bench Verified | 28.2% | 37.1% | N/A | N/A | | LiveCodeBench | 36.8% | 48.5% | 42.1% | 52.3% | | MATH (500) | 68.4% | 77.9% | 68.0% | 73.8% |
Sources: Meta AI Llama 4 benchmarks, LMSYS Chatbot Arena, Together AI testing, Hugging Face Open LLM Leaderboard.
Here's the interesting pattern:
Scout (17B active) roughly matches Llama 3.1 70B on MMLU (79.6% vs 86.0%) and MATH (68.4% vs 68.0%). It does this with 4x fewer active parameters per token. That's the MoE promise delivered.
Maverick (also 17B active but from a larger pool of 400B) beats Llama 3.1 70B on GPQA (61.8% vs 46.7%) and MATH (77.9% vs 68.0%). On coding benchmarks, Maverick is competitive with the much larger 405B dense model.
The efficiency math
| Model | Active params | MMLU | MMLU per active billion | |-------|--------------|------|------------------------| | Scout | 17B | 79.6% | 4.68 | | Maverick | 17B | 85.5% | 5.03 | | Llama 3.1 70B | 70B | 86.0% | 1.23 | | Llama 3.1 405B | 405B | 87.3% | 0.22 |
"MMLU per active billion parameters" is a made-up metric, but it illustrates the point. Maverick extracts 4x more benchmark performance per active parameter than Llama 3.1 70B. That translates directly to inference cost savings.
Inference cost implications
| Model | Active params | Tokens/sec (A100) | Relative inference cost | |-------|-------------|-------------------|----------------------| | Scout | 17B | ~110 | 1x | | Maverick | 17B | ~95 | ~1.2x (more memory) | | Llama 3.1 70B | 70B | ~35 | ~3.5x | | Llama 3.1 405B | 405B | ~8 | ~15x |
Sources: Together AI, my inference benchmarks on A100.
Scout runs at ~110 tokens per second on a single A100, vs ~35 for Llama 3.1 70B. Roughly 3x faster. Maverick is slightly slower than Scout because loading the larger expert pool requires more memory bandwidth, but it's still 2.7x faster than the 70B dense model.
For equivalent quality, you're looking at 3-4x lower inference costs with the MoE architecture.
The 10M context window test
I tested Scout's 10M token context window with progressively larger documents:
| Document size | Retrieval accuracy | Response quality | |--------------|-------------------|-----------------| | 100K tokens | 92% | Good | | 500K tokens | 84% | Good | | 1M tokens | 71% | Acceptable | | 5M tokens | 53% | Degraded | | 10M tokens | 38% | Poor |
The 10M context window works in the sense that the model doesn't crash. But at 5M+ tokens, retrieval accuracy drops below useful levels. At 10M, it's barely better than random for needle-in-a-haystack tests.
Practical useful range: about 1M tokens, which is still excellent for an open model.
My take
Meta's MoE pivot makes sense. DeepSeek proved that MoE architectures deliver frontier quality at a fraction of the inference cost. Llama 4 follows that playbook.
| Winner | Why | |--------|-----| | Inference cost-sensitive deployments | 3-4x cheaper than equivalent dense models | | Long-context applications | Scout's 10M window is novel for open models | | Edge/local deployment | 17B active params can run on consumer hardware |
| Loser | Why | |-------|-----| | Raw benchmark chasers | Maverick doesn't beat Llama 3.1 405B on everything | | Fine-tuning community | MoE models are harder to fine-tune than dense models |
The MoE era of open source is officially here. Llama's shift validates what DeepSeek demonstrated: you don't need every parameter active for every token. Smart routing is the future.
My "active parameters" column in the spreadsheet is getting more use than the "total parameters" column. Funny how quickly the important metric changes.
If you found this interesting, you might also like:
- GPT-3 vs GPT-J: the first real open source challenger, in data
- Llama 2 is here and it's actually good. My benchmark data.
- DALL-E's first images vs what people expected: a data comparison
- Google's PaLM has 540 billion parameters. Let me put that number in context.
- ChatGPT vs GPT-3: same model family, wildly different results. The data.
-- dataku