LLaMA leaked. Here's what Meta's model weights actually look like.
Meta's LLaMA was supposed to be research-only. It leaked within a week. Now everyone can benchmark it. I ran LLaMA-13B against GPT-3.5 on 5 tasks. The results are closer than Meta probably wanted.
So that happened fast.
Meta AI released the LLaMA paper on February 24 with a form you could fill out to request access for research purposes. The weights showed up on a torrent site within a week. By March 3, they were on Hugging Face and anyone with enough disk space could download them.
I won't get into the ethics of the leak. I will get into the benchmarks.
LLaMA-13B vs. GPT-3.5-turbo: my 5-task comparison
I ran both models through 100 prompts each on 5 task categories. For LLaMA-13B, I used Together AI's hosted inference. For GPT-3.5-turbo, the standard OpenAI API. Same prompts, same evaluation criteria.
| Task | Prompts | LLaMA-13B win rate | GPT-3.5 win rate | Tie rate | |------|---------|-------------------|------------------|----------| | Factual Q&A | 20 | 30% | 55% | 15% | | Creative writing | 20 | 35% | 40% | 25% | | Code generation (Python) | 20 | 20% | 65% | 15% | | Summarization | 20 | 40% | 35% | 25% | | Reasoning (logic puzzles) | 20 | 15% | 70% | 15% | | Overall | 100 | 28% | 53% | 19% |
Evaluation: blind rating by me and two friends. We didn't know which response came from which model.
GPT-3.5 wins overall. That's expected. It's a much larger model with RLHF fine-tuning, and LLaMA-13B is a raw base model with no instruction tuning.
But look at summarization. LLaMA-13B won 40% of head-to-head matchups against GPT-3.5-turbo on summarization. A free, 13B-parameter model. Against a commercial API product.
And creative writing was close: 35% vs 40%, with 25% ties.
The benchmark numbers from Meta's paper
Meta published extensive benchmarks in the LLaMA paper. Here's how LLaMA compares across sizes:
| Benchmark | LLaMA 7B | LLaMA 13B | LLaMA 33B | LLaMA 65B | GPT-3 175B | |-----------|----------|-----------|-----------|-----------|------------| | HellaSwag | 76.1% | 79.2% | 82.8% | 84.2% | 78.9% | | MMLU (5-shot) | 35.1% | 46.9% | 57.8% | 63.4% | 43.9% | | ARC Challenge | 47.6% | 52.7% | 57.8% | 60.2% | 51.4% | | WinoGrande | 70.1% | 73.0% | 76.0% | 77.4% | 70.2% | | TruthfulQA | 33.3% | 41.7% | 44.4% | 48.7% | 37.3% | | HumanEval | 10.5% | 15.8% | 21.7% | 23.7% | N/A |
Source: LLaMA paper, Table 3 and Table 9.
LLaMA-13B beats GPT-3 175B on HellaSwag and MMLU. A 13B model outperforming a 175B model. That's a 13x parameter difference.
And LLaMA-65B beats GPT-3 on every benchmark in this table. Every single one.
Now, GPT-3 is not GPT-3.5-turbo. The RLHF-tuned model is significantly better. But the raw base model comparison shows how far training efficiency has come. Meta trained LLaMA on more tokens (1.4T) with a smaller model, exactly as the Chinchilla scaling laws suggested they should.
Cost to replicate
This is what really got my attention. From Meta's paper:
| Model | Parameters | Training tokens | GPU hours (A100-80GB) | Estimated cloud cost | |-------|-----------|----------------|----------------------|---------------------| | LLaMA 7B | 6.7B | 1.0T | 82,432 | ~$130K | | LLaMA 13B | 13.0B | 1.0T | 135,168 | ~$215K | | LLaMA 33B | 32.5B | 1.4T | 530,432 | ~$850K | | LLaMA 65B | 65.2B | 1.4T | 1,022,362 | ~$1.6M |
Cloud cost estimates based on A100-80GB at $1.60/hour (AWS on-demand). Meta used their own hardware, so their actual cost is lower.
$215K to train a model that beats GPT-3 on most benchmarks. That's well within reach of a well-funded startup or university lab. A year ago, training a competitive model meant spending $5-10M minimum. The barrier to entry just dropped by 10-50x.
Why the leak matters
The research access form was a speed bump, not a wall. Meta probably knew the weights would leak. (Conspiracy theory? Maybe. But they could have kept this internal if they really wanted to.)
What the leak means in practice:
-
Anyone can now fine-tune a strong base model. Stanford Alpaca was built on LLaMA 7B with 52K instruction examples. Cost: $600 in API fees to generate training data using GPT-3.5. Total cost including compute: under $100 for the fine-tuning itself.
-
The "moat" argument gets weaker. If a free 13B model matches GPT-3 on standard benchmarks, the value proposition of commercial APIs rests on RLHF quality, reliability, and convenience. Not on raw model capability.
-
The open source community now has a baseline to iterate on. GPT-J was okay. BLOOM was okay. LLaMA is actually good. That distinction matters for the derivative models that will follow.
What I'm watching next
LLaMA-13B with instruction tuning. When someone properly fine-tunes it with high-quality instruction data (which is happening right now, multiple teams), the gap with GPT-3.5-turbo should narrow significantly.
My estimate: within 3 months, a LLaMA-13B derivative will match GPT-3.5 quality on at least 3 of my 5 test categories. The reasoning gap will take longer to close. But for everyday text tasks? The free option is getting close.
The model weights are out there now. The genie isn't going back in the bottle.
If you found this interesting, you might also like:
- GPT-3 vs GPT-J: the first real open source challenger, in data
- DALL-E's first images vs what people expected: a data comparison
- Google's PaLM has 540 billion parameters. Let me put that number in context.
- Midjourney v3 vs DALL-E 2: 100 prompts, head to head
- ChatGPT vs GPT-3: same model family, wildly different results. The data.
-- dataku