Llama 4 405B vs Llama 3.1 405B: same size, very different model
Meta kept the size but changed the architecture. Llama 4 405B uses MoE, so only ~100B parameters are active. I benchmarked both on 10 tasks. Llama 4 is faster and scores 8-12% higher on coding. Training quality over brute force.
Same parameter count. Same name lineage. Completely different model.
Meta AI released Llama 4 with a 405B variant that matches the 3.1 generation's largest model in total size. But it's a MoE model now. Only about 100B parameters are active per token, vs all 405B in the dense Llama 3.1.
I ran them head to head.
Architecture comparison
| Spec | Llama 4 405B | Llama 3.1 405B | |------|-------------|----------------| | Total parameters | 405B | 405B | | Active parameters | ~100B | 405B | | Architecture | MoE (64 experts, 4 active) | Dense transformer | | Training tokens | Not disclosed | 15T | | Context window | 128K | 128K | | Quantization-friendly | Yes | Limited |
Sources: Meta AI, Hugging Face model cards.
Benchmark comparison
| Benchmark | Llama 4 405B | Llama 3.1 405B | Delta | |-----------|-------------|----------------|-------| | MMLU | 90.1% | 87.3% | +2.8 | | HumanEval | 93.4% | 89.0% | +4.4 | | MATH | 82.6% | 73.8% | +8.8 | | GPQA Diamond | 58.4% | 51.1% | +7.3 | | SWE-bench Verified | 46.2% | N/A | New | | IFEval | 89.1% | 86.0% | +3.1 | | LiveCodeBench | 56.8% | 44.3% | +12.5 | | Codeforces | 1,680 | 1,210 | +470 |
Sources: Meta AI Llama 4 benchmarks, Together AI, LMSYS Chatbot Arena.
Llama 4 405B beats 3.1 405B on every benchmark I tested. The improvements range from +2.8 points (MMLU) to +12.5 points (LiveCodeBench).
The coding gains are the biggest story: HumanEval +4.4, MATH +8.8, LiveCodeBench +12.5. Meta clearly invested heavily in code and math training data.
Inference speed comparison
| Metric | Llama 4 405B (MoE) | Llama 3.1 405B (Dense) | |--------|--------------------|-----------------------| | Tokens/sec (8xA100) | 42 t/s | 8 t/s | | Tokens/sec (8xH100) | 68 t/s | 18 t/s | | Memory requirement | ~220 GB (FP16) | ~810 GB (FP16) | | Min hardware (FP16) | 4xH100 (320GB) | 8xH100 (640GB) | | Min hardware (Q4) | 2xH100 (160GB) | 4xH100 (320GB) |
Sources: Together AI, my inference testing.
The MoE architecture gives Llama 4 405B a 3.8-5.2x speed advantage on the same hardware. At 42 t/s on 8xA100s vs 8 t/s for the dense model, it's not even close.
The memory requirement is also much lower. In FP16, you need ~220GB for Llama 4 (only active experts + routing need fast access) vs ~810GB for 3.1 (every parameter must be in memory).
This means you can run Llama 4 405B on half the GPUs that Llama 3.1 405B required. Lower hardware bar, lower cost, faster speed.
Cost per million tokens
| Setup | Llama 4 405B | Llama 3.1 405B | Savings | |-------|-------------|----------------|---------| | Together AI (API) | $0.80/M | $2.40/M | 67% | | Self-hosted (8xH100 rental) | ~$0.35/M | ~$1.20/M | 71% | | Self-hosted (4xH100 rental) | ~$0.28/M | N/A (can't fit) | N/A |
Sources: Together AI, my cost calculations based on rental pricing.
67-71% cost reduction for a model that's better on every benchmark. MoE delivers exactly what the theory predicts: same total knowledge, fewer active computations, lower cost.
My 10-task evaluation
| Task | Llama 4 405B | Llama 3.1 405B | |------|-------------|----------------| | Python function generation | 88% | 78% | | Bug fix from description | 82% | 72% | | SQL query from natural language | 90% | 84% | | Document summarization | 86% | 82% | | Math word problems | 80% | 68% | | Code review | 84% | 76% | | Data analysis | 82% | 78% | | Creative writing | 78% | 76% | | Instruction following | 88% | 84% | | Multi-step reasoning | 76% | 64% | | Average | 83.4% | 76.2% |
+7.2 points on average. The biggest gains are on coding (+10) and reasoning (+12). Creative writing improved the least (+2), which makes sense: MoE and better training data help most with structured, logical tasks.
The takeaway
Llama 4 405B is better than Llama 3.1 405B at everything, while using 4x fewer active parameters per token and requiring half the hardware.
This is what "training quality over brute force" looks like in the data. Meta didn't make a bigger model. They made a smarter one.
For anyone still running Llama 3.1 405B: upgrade. There's no reason not to. Better, faster, cheaper. The rare trifecta.
If you found this interesting, you might also like:
- GPT-3 vs GPT-J: the first real open source challenger, in data
- Google's PaLM has 540 billion parameters. Let me put that number in context.
- LLaMA leaked. Here's what Meta's model weights actually look like.
- Llama 2 is here and it's actually good. My benchmark data.
- Mistral 7B just beat Llama 2 13B. Small models are getting weird.
-- dataku