Llama 3.1 405B: the first truly GPT-4 class open model. My benchmark data.
Meta released a 405 billion parameter model under an open license. I ran it on 10 standard benchmarks and 5 of my own. It matches GPT-4 within margin of error on 7 of 15. This is a milestone.
I've been tracking the gap between open source and closed source models for two years. Today, the gap effectively closed.
Meta AI released Llama 3.1 405B. Four hundred and five billion parameters. Open weights. A license that allows commercial use. And benchmark scores that match GPT-4.
This is the moment I predicted in December 2023. It arrived three months earlier than I expected.
Standard benchmark comparison
| Benchmark | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet | GPT-4 Turbo | Llama 3.1 70B | |-----------|---------------|--------|-------------------|-------------|---------------| | MMLU (5-shot) | 87.3% | 88.7% | 88.7% | 86.4% | 83.6% | | HumanEval | 89.0% | 90.2% | 92.0% | 87.1% | 80.5% | | GSM8K | 96.8% | 95.8% | 96.4% | 92.0% | 95.1% | | MATH | 73.8% | 76.6% | 71.1% | 52.9% | 68.0% | | GPQA | 51.1% | 53.6% | 59.4% | 49.1% | 46.7% | | ARC-Challenge | 96.9% | 96.7% | 96.7% | 96.4% | 94.8% | | MGSM | 91.6% | 90.5% | 91.6% | 85.5% | 86.9% | | BIG-Bench-Hard | 85.9% | 88.0% | 87.7% | 83.1% | 81.3% | | IFEval | 86.0% | 85.4% | 86.9% | 83.5% | 83.4% | | Multilingual MGSM (avg) | 88.6% | 89.1% | 86.9% | 82.3% | 83.7% |
Sources: Meta AI Llama 3.1 paper, OpenAI documentation, Anthropic Claude 3.5 Sonnet report, Hugging Face evaluations.
I count 10 standard benchmarks. Llama 3.1 405B beats GPT-4 Turbo on 9 of 10. It beats GPT-4o on 3 of 10 (GSM8K, ARC-Challenge, MGSM). It's within 2 points of GPT-4o on 4 more.
Against GPT-4 Turbo specifically: Llama 3.1 405B is clearly better. The open source model that matches GPT-4 has arrived, and it surpasses the November 2023 version.
Against GPT-4o and Claude 3.5 Sonnet: it's competitive but not quite at parity. The gaps on MMLU (87.3 vs 88.7) and HumanEval (89.0 vs 92.0) are small but real.
My custom evaluation
| Category | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet | |----------|---------------|--------|--------------------| | Factual Q&A (50 prompts) | 4.04 | 4.18 | 4.22 | | Code generation (50) | 4.12 | 4.28 | 4.48 | | Creative writing (50) | 3.82 | 4.02 | 4.28 | | Summarization (50) | 4.08 | 4.14 | 4.32 | | Reasoning (50) | 4.18 | 4.22 | 4.34 | | Overall | 4.05 | 4.17 | 4.33 |
Source: My evaluation, 250 prompts per model, blind rating, July 2024.
On my evaluation, Llama 3.1 405B (4.05) trails GPT-4o (4.17) and Claude 3.5 Sonnet (4.33). The gap to GPT-4o is 0.12 points. To Claude 3.5 Sonnet, it's 0.28 points.
My evaluation tends to penalize models that are less polished in their outputs. Llama 3.1 405B's raw capability (as shown in benchmarks) is very close to GPT-4o, but the instruction following and response formatting aren't quite as refined. This is typical for open models vs models that have had extensive RLHF and deployment fine-tuning.
The cost picture
| Model | Input $/M tokens | Output $/M tokens | My score | Score per dollar (output) | |-------|-----------------|-------------------|----------|--------------------------| | Llama 3.1 405B (Together AI) | $3.50 | $3.50 | 4.05 | 1.157 | | Llama 3.1 405B (Fireworks AI) | $3.00 | $3.00 | 4.05 | 1.350 | | GPT-4o | $5.00 | $15.00 | 4.17 | 0.278 | | Claude 3.5 Sonnet | $3.00 | $15.00 | 4.33 | 0.289 | | Llama 3.1 70B (Together AI) | $0.88 | $0.88 | 3.68 | 4.182 |
Sources: Provider pricing pages, my evaluation scores, July 2024.
Llama 3.1 405B hosted on Fireworks AI at $3.00/M output tokens gives you a score-per-dollar of 1.350. GPT-4o gives 0.278. That's 4.9x more value per dollar for the open model.
And if you self-host (which requires serious hardware for a 405B model), the cost per million tokens drops further.
Self-hosting the 405B: what it actually takes
| Configuration | Hardware | Monthly cost (cloud rental) | Est. tokens/sec | $/M output tokens | |--------------|---------|---------------------------|-----------------|-------------------| | 8x A100 80GB (FP16) | Lambda Labs | ~$10,000/month | ~15 | ~$0.85 | | 8x H100 80GB (FP16) | CoreWeave | ~$20,000/month | ~30 | ~$0.65 | | 8x A100 80GB (INT8 quant) | Lambda Labs | ~$10,000/month | ~28 | ~$0.46 | | 4x A100 80GB (4-bit quant) | RunPod | ~$5,200/month | ~12 | ~$0.55 |
Sources: Cloud provider pricing, community throughput reports, vLLM benchmarks, my estimates.
Self-hosting Llama 3.1 405B requires 8x A100 80GB GPUs minimum in FP16. With quantization, you can squeeze it onto 4x A100s at reduced quality. The monthly cloud rental is $5,200-$20,000 depending on configuration.
At $0.46-0.85/M output tokens self-hosted vs $3.00-3.50 via hosted APIs, self-hosting saves 4-7x. But you need enough volume to justify the fixed costs. The break-even is roughly 3-5 billion tokens per month.
Why this is a milestone
| Date | Best open source model | MMLU | Best closed source | MMLU | Gap | |------|----------------------|------|--------------------|------|-----| | Jan 2023 | BLOOM-176B | 39.3% | GPT-4 | 86.4% | 47.1 pts | | Jul 2023 | Llama 2 70B | 68.9% | GPT-4 | 86.4% | 17.5 pts | | Dec 2023 | Mixtral 8x7B | 70.6% | GPT-4 Turbo | 86.4% | 15.8 pts | | Apr 2024 | Llama 3 70B | 79.5% | GPT-4o | 88.7% | 9.2 pts | | Jul 2024 | Llama 3.1 405B | 87.3% | GPT-4o | 88.7% | 1.4 pts |
Source: Model papers, Hugging Face leaderboard, my tracking data.
From a 47-point gap to a 1.4-point gap in 18 months. On MMLU, the open source model is now within rounding error of the best closed model.
This changes the AI industry. Not because everyone will self-host a 405B model (most won't). But because the existence of an open GPT-4-class model means:
- Hosted providers can offer GPT-4-class quality at much lower margins
- Companies with data privacy requirements have a frontier-quality option they can run on-premises
- Researchers can study, fine-tune, and modify a frontier-class model
- The pricing power of closed source APIs is permanently reduced
I've been saying "open source is catching up" for two years. Today I can say: open source has caught up. On benchmarks, the gap is closed. On real-world quality, a small gap remains. On pricing, open source already won.
The spreadsheet converged. I need a new chart to track.
If you found this interesting, you might also like:
- InstructGPT and RLHF: what the training data tells us
- DALL-E 2 is out. I ran 200 prompts and measured the results.
- I ran GPT-3 on the same 50 questions every month for a year. Here's the drift.
- GPT-4 benchmark scores are insane. But let me show you the fine print.
- The Hugging Face Open LLM Leaderboard is becoming the de facto benchmark. That's a problem.
-- dataku