LLaMA leaked. Here's what Meta's model weights actually look like.

So that happened fast.

Meta AI released the LLaMA paper on February 24 with a form you could fill out to request access for research purposes. The weights showed up on a torrent site within a week. By March 3, they were on Hugging Face and anyone with enough disk space could download them.

I won't get into the ethics of the leak. I will get into the benchmarks.

LLaMA-13B vs. GPT-3.5-turbo: my 5-task comparison

I ran both models through 100 prompts each on 5 task categories. For LLaMA-13B, I used Together AI's hosted inference. For GPT-3.5-turbo, the standard OpenAI API. Same prompts, same evaluation criteria.

| Task | Prompts | LLaMA-13B win rate | GPT-3.5 win rate | Tie rate | |------|---------|-------------------|------------------|----------| | Factual Q&A | 20 | 30% | 55% | 15% | | Creative writing | 20 | 35% | 40% | 25% | | Code generation (Python) | 20 | 20% | 65% | 15% | | Summarization | 20 | 40% | 35% | 25% | | Reasoning (logic puzzles) | 20 | 15% | 70% | 15% | | Overall | 100 | 28% | 53% | 19% |

Evaluation: blind rating by me and two friends. We didn't know which response came from which model.

GPT-3.5 wins overall. That's expected. It's a much larger model with RLHF fine-tuning, and LLaMA-13B is a raw base model with no instruction tuning.

But look at summarization. LLaMA-13B won 40% of head-to-head matchups against GPT-3.5-turbo on summarization. A free, 13B-parameter model. Against a commercial API product.

And creative writing was close: 35% vs 40%, with 25% ties.

The benchmark numbers from Meta's paper

Meta published extensive benchmarks in the LLaMA paper. Here's how LLaMA compares across sizes:

| Benchmark | LLaMA 7B | LLaMA 13B | LLaMA 33B | LLaMA 65B | GPT-3 175B | |-----------|----------|-----------|-----------|-----------|------------| | HellaSwag | 76.1% | 79.2% | 82.8% | 84.2% | 78.9% | | MMLU (5-shot) | 35.1% | 46.9% | 57.8% | 63.4% | 43.9% | | ARC Challenge | 47.6% | 52.7% | 57.8% | 60.2% | 51.4% | | WinoGrande | 70.1% | 73.0% | 76.0% | 77.4% | 70.2% | | TruthfulQA | 33.3% | 41.7% | 44.4% | 48.7% | 37.3% | | HumanEval | 10.5% | 15.8% | 21.7% | 23.7% | N/A |

Source: LLaMA paper, Table 3 and Table 9.

LLaMA-13B beats GPT-3 175B on HellaSwag and MMLU. A 13B model outperforming a 175B model. That's a 13x parameter difference.

And LLaMA-65B beats GPT-3 on every benchmark in this table. Every single one.

Now, GPT-3 is not GPT-3.5-turbo. The RLHF-tuned model is significantly better. But the raw base model comparison shows how far training efficiency has come. Meta trained LLaMA on more tokens (1.4T) with a smaller model, exactly as the Chinchilla scaling laws suggested they should.

Cost to replicate

This is what really got my attention. From Meta's paper:

| Model | Parameters | Training tokens | GPU hours (A100-80GB) | Estimated cloud cost | |-------|-----------|----------------|----------------------|---------------------| | LLaMA 7B | 6.7B | 1.0T | 82,432 | ~$130K | | LLaMA 13B | 13.0B | 1.0T | 135,168 | ~$215K | | LLaMA 33B | 32.5B | 1.4T | 530,432 | ~$850K | | LLaMA 65B | 65.2B | 1.4T | 1,022,362 | ~$1.6M |

Cloud cost estimates based on A100-80GB at $1.60/hour (AWS on-demand). Meta used their own hardware, so their actual cost is lower.

$215K to train a model that beats GPT-3 on most benchmarks. That's well within reach of a well-funded startup or university lab. A year ago, training a competitive model meant spending $5-10M minimum. The barrier to entry just dropped by 10-50x.

Why the leak matters

The research access form was a speed bump, not a wall. Meta probably knew the weights would leak. (Conspiracy theory? Maybe. But they could have kept this internal if they really wanted to.)

What the leak means in practice:

Anyone can now fine-tune a strong base model. Stanford Alpaca was built on LLaMA 7B with 52K instruction examples. Cost: $600 in API fees to generate training data using GPT-3.5. Total cost including compute: under $100 for the fine-tuning itself.
The "moat" argument gets weaker. If a free 13B model matches GPT-3 on standard benchmarks, the value proposition of commercial APIs rests on RLHF quality, reliability, and convenience. Not on raw model capability.
The open source community now has a baseline to iterate on. GPT-J was okay. BLOOM was okay. LLaMA is actually good. That distinction matters for the derivative models that will follow.

What I'm watching next

LLaMA-13B with instruction tuning. When someone properly fine-tunes it with high-quality instruction data (which is happening right now, multiple teams), the gap with GPT-3.5-turbo should narrow significantly.

My estimate: within 3 months, a LLaMA-13B derivative will match GPT-3.5 quality on at least 3 of my 5 test categories. The reasoning gap will take longer to close. But for everyday text tasks? The free option is getting close.

The model weights are out there now. The genie isn't going back in the bottle.

If you found this interesting, you might also like:

-- dataku

LLaMA leaked. Here's what Meta's model weights actually look like.

LLaMA-13B vs. GPT-3.5-turbo: my 5-task comparison

The benchmark numbers from Meta's paper

Cost to replicate

Why the leak matters

What I'm watching next

More from dataku

Claude Opus 4.6 review: the 1M context model

o4-mini vs Claude 4 Sonnet vs Gemini 2.5 Flash: the speed tier showdown

Gemini 2.5 Ultra: Google's best model vs the field