Llama 3.1 405B: the first truly GPT-4 class open model. My benchmark data.

I've been tracking the gap between open source and closed source models for two years. Today, the gap effectively closed.

Meta AI released Llama 3.1 405B. Four hundred and five billion parameters. Open weights. A license that allows commercial use. And benchmark scores that match GPT-4.

This is the moment I predicted in December 2023. It arrived three months earlier than I expected.

Standard benchmark comparison

| Benchmark | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet | GPT-4 Turbo | Llama 3.1 70B | |-----------|---------------|--------|-------------------|-------------|---------------| | MMLU (5-shot) | 87.3% | 88.7% | 88.7% | 86.4% | 83.6% | | HumanEval | 89.0% | 90.2% | 92.0% | 87.1% | 80.5% | | GSM8K | 96.8% | 95.8% | 96.4% | 92.0% | 95.1% | | MATH | 73.8% | 76.6% | 71.1% | 52.9% | 68.0% | | GPQA | 51.1% | 53.6% | 59.4% | 49.1% | 46.7% | | ARC-Challenge | 96.9% | 96.7% | 96.7% | 96.4% | 94.8% | | MGSM | 91.6% | 90.5% | 91.6% | 85.5% | 86.9% | | BIG-Bench-Hard | 85.9% | 88.0% | 87.7% | 83.1% | 81.3% | | IFEval | 86.0% | 85.4% | 86.9% | 83.5% | 83.4% | | Multilingual MGSM (avg) | 88.6% | 89.1% | 86.9% | 82.3% | 83.7% |

Sources: Meta AI Llama 3.1 paper, OpenAI documentation, Anthropic Claude 3.5 Sonnet report, Hugging Face evaluations.

I count 10 standard benchmarks. Llama 3.1 405B beats GPT-4 Turbo on 9 of 10. It beats GPT-4o on 3 of 10 (GSM8K, ARC-Challenge, MGSM). It's within 2 points of GPT-4o on 4 more.

Against GPT-4 Turbo specifically: Llama 3.1 405B is clearly better. The open source model that matches GPT-4 has arrived, and it surpasses the November 2023 version.

Against GPT-4o and Claude 3.5 Sonnet: it's competitive but not quite at parity. The gaps on MMLU (87.3 vs 88.7) and HumanEval (89.0 vs 92.0) are small but real.

My custom evaluation

| Category | Llama 3.1 405B | GPT-4o | Claude 3.5 Sonnet | |----------|---------------|--------|--------------------| | Factual Q&A (50 prompts) | 4.04 | 4.18 | 4.22 | | Code generation (50) | 4.12 | 4.28 | 4.48 | | Creative writing (50) | 3.82 | 4.02 | 4.28 | | Summarization (50) | 4.08 | 4.14 | 4.32 | | Reasoning (50) | 4.18 | 4.22 | 4.34 | | Overall | 4.05 | 4.17 | 4.33 |

Source: My evaluation, 250 prompts per model, blind rating, July 2024.

On my evaluation, Llama 3.1 405B (4.05) trails GPT-4o (4.17) and Claude 3.5 Sonnet (4.33). The gap to GPT-4o is 0.12 points. To Claude 3.5 Sonnet, it's 0.28 points.

My evaluation tends to penalize models that are less polished in their outputs. Llama 3.1 405B's raw capability (as shown in benchmarks) is very close to GPT-4o, but the instruction following and response formatting aren't quite as refined. This is typical for open models vs models that have had extensive RLHF and deployment fine-tuning.

The cost picture

| Model | Input $/M tokens | Output $/M tokens | My score | Score per dollar (output) | |-------|-----------------|-------------------|----------|--------------------------| | Llama 3.1 405B (Together AI) | $3.50 | $3.50 | 4.05 | 1.157 | | Llama 3.1 405B (Fireworks AI) | $3.00 | $3.00 | 4.05 | 1.350 | | GPT-4o | $5.00 | $15.00 | 4.17 | 0.278 | | Claude 3.5 Sonnet | $3.00 | $15.00 | 4.33 | 0.289 | | Llama 3.1 70B (Together AI) | $0.88 | $0.88 | 3.68 | 4.182 |

Sources: Provider pricing pages, my evaluation scores, July 2024.

Llama 3.1 405B hosted on Fireworks AI at $3.00/M output tokens gives you a score-per-dollar of 1.350. GPT-4o gives 0.278. That's 4.9x more value per dollar for the open model.

And if you self-host (which requires serious hardware for a 405B model), the cost per million tokens drops further.

Self-hosting the 405B: what it actually takes

| Configuration | Hardware | Monthly cost (cloud rental) | Est. tokens/sec | $/M output tokens | |--------------|---------|---------------------------|-----------------|-------------------| | 8x A100 80GB (FP16) | Lambda Labs | ~$10,000/month | ~15 | ~$0.85 | | 8x H100 80GB (FP16) | CoreWeave | ~$20,000/month | ~30 | ~$0.65 | | 8x A100 80GB (INT8 quant) | Lambda Labs | ~$10,000/month | ~28 | ~$0.46 | | 4x A100 80GB (4-bit quant) | RunPod | ~$5,200/month | ~12 | ~$0.55 |

Sources: Cloud provider pricing, community throughput reports, vLLM benchmarks, my estimates.

Self-hosting Llama 3.1 405B requires 8x A100 80GB GPUs minimum in FP16. With quantization, you can squeeze it onto 4x A100s at reduced quality. The monthly cloud rental is $5,200-$20,000 depending on configuration.

At $0.46-0.85/M output tokens self-hosted vs $3.00-3.50 via hosted APIs, self-hosting saves 4-7x. But you need enough volume to justify the fixed costs. The break-even is roughly 3-5 billion tokens per month.

Why this is a milestone

| Date | Best open source model | MMLU | Best closed source | MMLU | Gap | |------|----------------------|------|--------------------|------|-----| | Jan 2023 | BLOOM-176B | 39.3% | GPT-4 | 86.4% | 47.1 pts | | Jul 2023 | Llama 2 70B | 68.9% | GPT-4 | 86.4% | 17.5 pts | | Dec 2023 | Mixtral 8x7B | 70.6% | GPT-4 Turbo | 86.4% | 15.8 pts | | Apr 2024 | Llama 3 70B | 79.5% | GPT-4o | 88.7% | 9.2 pts | | Jul 2024 | Llama 3.1 405B | 87.3% | GPT-4o | 88.7% | 1.4 pts |

Source: Model papers, Hugging Face leaderboard, my tracking data.

From a 47-point gap to a 1.4-point gap in 18 months. On MMLU, the open source model is now within rounding error of the best closed model.

This changes the AI industry. Not because everyone will self-host a 405B model (most won't). But because the existence of an open GPT-4-class model means:

Hosted providers can offer GPT-4-class quality at much lower margins
Companies with data privacy requirements have a frontier-quality option they can run on-premises
Researchers can study, fine-tune, and modify a frontier-class model
The pricing power of closed source APIs is permanently reduced

I've been saying "open source is catching up" for two years. Today I can say: open source has caught up. On benchmarks, the gap is closed. On real-world quality, a small gap remains. On pricing, open source already won.

The spreadsheet converged. I need a new chart to track.

If you found this interesting, you might also like:

-- dataku

Llama 3.1 405B: the first truly GPT-4 class open model. My benchmark data.

Standard benchmark comparison

My custom evaluation

The cost picture

Self-hosting the 405B: what it actually takes

Why this is a milestone

More from dataku

My monthly benchmark dashboard: March 2026 update

Claude Opus 4.5: Anthropic's latest flagship, benchmarked

The state of AI benchmarks in early 2026: what still works?