Model ComparisonsApril 18, 20246 min read

Llama 3 8B beats Llama 2 70B. Let that sink in.

A model 9x smaller is now better. I benchmarked Llama 3 8B against Llama 2 70B on 6 tasks. The small model wins on 4 of them. Training data quality is eating model size for breakfast.

Stop and read that title again.

Llama 3 8B. Eight billion parameters. Beating Llama 2 70B. Seventy billion parameters. A model that is 8.75x smaller is now better at most tasks.

Meta AI released Llama 3 on April 18th, and the benchmark numbers broke my mental model of how scaling works. Let me show you what I mean.

The head-to-head numbers

I ran both models through 6 standard benchmarks and 6 of my own tasks:

| Benchmark | Llama 3 8B | Llama 2 70B | Llama 3 70B | Winner (8B vs 2-70B) | |-----------|-----------|-------------|-------------|---------------------| | MMLU (5-shot) | 66.6% | 68.9% | 79.5% | Llama 2 70B (+2.3) | | HumanEval | 62.2% | 29.9% | 81.7% | Llama 3 8B (+32.3!) | | GSM8K | 79.6% | 56.8% | 93.0% | Llama 3 8B (+22.8) | | ARC-Challenge | 78.6% | 67.3% | 93.0% | Llama 3 8B (+11.3) | | GPQA | 32.8% | 28.1% | 39.5% | Llama 3 8B (+4.7) | | WinoGrande | 78.5% | 80.2% | 83.1% | Llama 2 70B (+1.7) |

Sources: Meta AI Llama 3 model card, Llama 2 paper (arXiv), Hugging Face evaluation results.

Llama 3 8B beats Llama 2 70B on 4 of 6 benchmarks. On HumanEval (code), the gap is 32.3 points. On GSM8K (math), it's 22.8 points. These aren't marginal differences. These are blowouts.

Llama 2 70B only wins on MMLU (by 2.3 points) and WinoGrande (by 1.7 points). The knowledge-heavy benchmarks slightly favor the larger model. Everything else goes to the smaller, newer one.

My own evaluation confirms it

| Task | Llama 3 8B | Llama 2 70B | Winner | |------|-----------|-------------|--------| | Python function generation (25 tasks) | 72% pass rate | 48% pass rate | Llama 3 8B | | Summarize 2-page doc (25 tasks) | 3.84/5 rating | 3.68/5 rating | Llama 3 8B | | Multi-step reasoning (25 tasks) | 64% correct | 52% correct | Llama 3 8B | | Factual Q&A (25 tasks) | 71% correct | 74% correct | Llama 2 70B | | Creative writing (25 tasks) | 3.72/5 rating | 3.61/5 rating | Llama 3 8B | | Following complex instructions (25 tasks) | 68% fully correct | 58% fully correct | Llama 3 8B |

Source: My evaluation, 150 tasks, March-April 2024.

5 out of 6 for the 8B model. The only category where size still wins is factual Q&A, where having more parameters means more memorized knowledge.

How is this possible?

A 9x smaller model shouldn't be better. Under simple scaling laws, more parameters = more capacity = better performance. But Llama 3 changed three things:

| Factor | Llama 2 70B | Llama 3 8B | Impact | |--------|-------------|-----------|--------| | Training data size | 2T tokens | 15T tokens | 7.5x more data | | Training data quality | Web crawl + curation | Web crawl + heavy filtering + synthetic data | Higher quality per token | | Architecture | Standard transformer | Grouped query attention, updated tokenizer (128K vocab) | More efficient |

Sources: Meta AI blog post, Llama 2 paper, Llama 3 model card.

The biggest factor: 15 trillion training tokens vs 2 trillion. Llama 3 8B saw 7.5x more text during training than Llama 2 70B. Meta also invested heavily in data filtering, removing low-quality content and adding synthetic training data for code and math.

This is the Chinchilla scaling law in action. DeepMind's 2022 paper argued that most models were undertrained. They had too many parameters relative to their training data. Llama 2 70B was trained on 2T tokens, which by Chinchilla's formula should optimally train a model of roughly 35B parameters. Llama 2 70B was undertrained by about 2x.

Llama 3 8B with 15T tokens is overtrained by Chinchilla standards (the formula suggests 15T tokens should optimally train a 375B parameter model). But Meta deliberately overtrained a small model to maximize quality at a fixed inference cost. Smart.

The inference cost implications

This is where it gets practical:

| Model | Parameters | Tokens/sec (A100) | Cost/M output tokens (self-hosted) | Quality (my eval avg) | |-------|-----------|-------------------|-------------------------------------|----------------------| | Llama 2 70B | 70B | ~25 | ~$0.40 | 3.44/5 | | Llama 3 8B | 8B | ~200 | ~$0.05 | 3.68/5 | | Llama 3 70B | 70B | ~25 | ~$0.40 | 4.12/5 | | GPT-3.5-turbo | Unknown | N/A (API) | $1.50 | 3.28/5 |

Sources: My throughput tests, API pricing, my evaluation scores.

Llama 3 8B: better quality than Llama 2 70B, at 8x the speed and 8x lower cost. And better quality than GPT-3.5-turbo at 30x lower cost (self-hosted).

You can run Llama 3 8B on a single consumer GPU. A RTX 4090 with 24GB VRAM handles it comfortably in FP16. With 4-bit quantization, it runs on a laptop GPU. A model that beats GPT-3.5-turbo. On your laptop.

I ran it on my MacBook Pro M2 using Ollama and got 38 tokens/second. That's usable for real work.

What this means for the model hierarchy

The old mental model:

Frontier (GPT-4, Claude 3 Opus) > Large open (Llama 2 70B) > Small open (Mistral 7B) > Budget API (GPT-3.5)

The new mental model:

Frontier (GPT-4, Claude 3 Opus) > Large open new gen (Llama 3 70B) > Small open new gen (Llama 3 8B) > Budget API (GPT-3.5) > Large open old gen (Llama 2 70B)

Llama 3 8B leapfrogged Llama 2 70B AND GPT-3.5-turbo. A new-generation small model is now better than the previous generation's large model. If this pattern continues (and I think it will), the "I need a 70B model" era is coming to an end for most use cases.

My expectations vs reality

I expected Llama 3 to be good. Meta spent a year on it. They had the Chinchilla data, they had the compute, they had the motivation.

I did NOT expect an 8B model to beat a 70B model. That prediction would have seemed absurd six months ago. The Chinchilla scaling laws suggested that data quantity and quality could compensate for model size, but the magnitude of the effect is bigger than I modeled.

Training data quality is eating model size for breakfast. The next year of AI progress is going to be about who has the best data pipeline, not who has the most parameters. My spreadsheet agrees.


If you found this interesting, you might also like:

-- dataku

More from dataku