Llama 2 is here and it's actually good. My benchmark data.

Today's ikigai: watching open source cross a threshold.

Meta AI released Llama 2 yesterday and I've been up since 5am running benchmarks. Three model sizes (7B, 13B, 70B), a commercial license, and a 70-page paper that actually explains the training process. This is different from the original LLaMA leak. This is official, legal, and designed for commercial use.

I need to talk about the 70B model specifically because the numbers are startling.

Llama 2 70B vs GPT-3.5-turbo: my 8-task benchmark

I set up Llama 2 70B-chat on Together AI (they had it available within hours) and ran it against GPT-3.5-turbo on the same 8 task categories. 50 prompts per category, blind evaluation.

| Task | Llama 2 70B avg | GPT-3.5 avg | Winner | Margin | |------|----------------|-------------|--------|--------| | Creative writing | 4.0 | 3.8 | Llama 2 | +0.2 | | Summarization | 4.1 | 3.9 | Llama 2 | +0.2 | | Code (Python) | 3.4 | 4.0 | GPT-3.5 | -0.6 | | Factual Q&A | 3.6 | 3.8 | GPT-3.5 | -0.2 | | Instruction following | 3.9 | 3.7 | Llama 2 | +0.2 | | Reasoning | 3.3 | 3.6 | GPT-3.5 | -0.3 | | Conversation | 4.0 | 3.9 | Llama 2 | +0.1 | | Translation | 3.8 | 3.7 | Llama 2 | +0.1 |

Score: Llama 2 wins on 5 tasks, GPT-3.5 wins on 3. Blind evaluation by me and two raters.

I did NOT expect this. An open source model matching GPT-3.5-turbo on more than half my test categories? Six months ago this would have sounded delusional. But here we are.

Let me be precise about what's happening. Llama 2 70B isn't better than GPT-3.5 overall. The average scores are 3.76 vs 3.80 (GPT-3.5 still edges it). But it's competitive. Within the margin of rater disagreement on most tasks.

The Meta paper numbers

The Llama 2 paper is unusually detailed. 77 pages. Meta published benchmarks against both open source and closed source models:

| Benchmark | Llama 2 7B | Llama 2 13B | Llama 2 70B | GPT-3.5 | GPT-4 | |-----------|-----------|-------------|-------------|---------|-------| | MMLU (5-shot) | 45.3% | 54.8% | 68.9% | 70.0% | 86.4% | | HellaSwag | 77.2% | 80.7% | 85.3% | 85.5% | 95.3% | | ARC Challenge | 53.1% | 59.4% | 64.6% | 85.2% | 96.3% | | HumanEval | 12.8% | 18.3% | 29.9% | 48.1% | 67.0% | | TruthfulQA | 33.3% | 41.9% | 44.9% | N/A | N/A |

Source: Llama 2 paper, Tables 3 and 14.

On MMLU, Llama 2 70B hits 68.9% vs GPT-3.5's 70.0%. That's a 1.1 percentage point gap. On HellaSwag, the gap is 0.2 points. These models are essentially tied on common-sense and knowledge benchmarks.

Where the gap persists: coding (HumanEval 29.9% vs 48.1%) and structured reasoning (ARC Challenge 64.6% vs 85.2%). GPT-3.5 is significantly better at code generation and science reasoning.

And GPT-4 remains in a completely different league. 86.4% MMLU vs 68.9%. That's not closing anytime soon.

The Llama 2 size comparisons

What really tells the story is comparing across Llama 2 sizes:

| Model | Parameters | MMLU | HellaSwag | HumanEval | Training tokens | |-------|-----------|------|-----------|-----------|----------------| | Llama 2 7B | 6.7B | 45.3% | 77.2% | 12.8% | 2.0T | | Llama 2 13B | 13.0B | 54.8% | 80.7% | 18.3% | 2.0T | | Llama 2 70B | 65.2B | 68.9% | 85.3% | 29.9% | 2.0T | | Llama 1 65B | 65.2B | 63.4% | 84.2% | 23.7% | 1.4T |

Source: Llama 2 paper and original LLaMA paper.

Llama 2 70B vs Llama 1 65B: same parameter count, but trained on 2.0T tokens instead of 1.4T (43% more data). MMLU jumped from 63.4% to 68.9% (+5.5 points), HumanEval from 23.7% to 29.9% (+6.2 points).

More data, same model size, better results. The Chinchilla scaling law continues to prove right. These models were undertrained before. More tokens is the cheapest path to better performance.

The commercial license changes everything

Here's what matters most for the market. Llama 1 leaked, which meant using it was legally questionable. Llama 2 has an actual commercial license:

| Feature | Llama 1 | Llama 2 | |---------|---------|---------| | License | Research only | Commercial (with restrictions) | | Monthly active user limit | N/A | 700M (above this, need Meta permission) | | Redistribution | Not allowed | Allowed | | Fine-tuning for commercial use | Gray area | Explicitly permitted | | Hosting on inference platforms | Legally risky | Explicitly allowed |

Source: Meta AI Llama 2 license.

The 700M MAU restriction matters only for companies the size of Google or Amazon. For everyone else, this is effectively an open source model with a commercial license. Together AI, Anyscale, and Hugging Face all had hosted inference available within hours of the launch.

What this means for pricing

If Llama 2 70B matches GPT-3.5-turbo quality on most tasks, and you can run it on your own hardware, the API pricing for GPT-3.5-turbo ($0.002/1K tokens) becomes the ceiling, not the floor.

Early hosting prices for Llama 2 70B:

| Provider | Price ($/1K tokens) | Comparison to GPT-3.5 | |----------|--------------------|-----------------------| | Together AI | $0.0009 | 55% cheaper | | Anyscale | $0.0010 | 50% cheaper | | Self-hosted (A100) | ~$0.0004 | 80% cheaper | | OpenAI GPT-3.5-turbo | $0.0020 | Baseline |

Sources: Provider pricing pages, July 2023. Self-hosted estimate based on A100 rental at $1.50/hr and ~2,500 tokens/second throughput.

At half the price for comparable quality, the value proposition flips. Why would you pay $0.002 for GPT-3.5-turbo when you can get Llama 2 70B for $0.0009?

The answer, for now: reliability, uptime, fine-tuning tools, and the GPT-3.5 advantage in code. But those moats are eroding fast.

Open source just got very, very real.

If you found this interesting, you might also like:

-- dataku

Llama 2 is here and it's actually good. My benchmark data.

Llama 2 70B vs GPT-3.5-turbo: my 8-task benchmark

The Meta paper numbers

The Llama 2 size comparisons

The commercial license changes everything

What this means for pricing

More from dataku

Claude Opus 4.6 review: the 1M context model

o4-mini vs Claude 4 Sonnet vs Gemini 2.5 Flash: the speed tier showdown

Gemini 2.5 Ultra: Google's best model vs the field