Model ComparisonsSeptember 27, 20236 min read

Mistral 7B just beat Llama 2 13B. Small models are getting weird.

A 7B parameter model outperforming a 13B model shouldn't be possible under simple scaling laws. But Mistral did it. I compared the benchmarks and the architecture differences that explain how.

Mistral AI released their first model on September 27, and it broke my mental model of how scaling works.

Mistral 7B. Seven billion parameters. It outperforms Llama 2 13B on basically every benchmark. A model with half the parameters beating a model with double its size, from a startup that's only months old.

Let me show you the numbers.

Head-to-head benchmark comparison

| Benchmark | Mistral 7B | Llama 2 7B | Llama 2 13B | Mistral vs Llama 2 13B | |-----------|-----------|-----------|-------------|----------------------| | MMLU (5-shot) | 60.1% | 45.3% | 54.8% | Mistral wins (+5.3 pts) | | HellaSwag | 81.3% | 77.2% | 80.7% | Mistral wins (+0.6 pts) | | ARC Challenge | 55.5% | 53.1% | 59.4% | Llama 2 13B wins (+3.9 pts) | | WinoGrande | 75.3% | 70.1% | 73.0% | Mistral wins (+2.3 pts) | | TruthfulQA | 42.2% | 33.3% | 41.9% | Mistral wins (+0.3 pts) | | HumanEval (coding) | 30.5% | 12.8% | 18.3% | Mistral wins (+12.2 pts) | | GSM8K (math) | 35.4% | 14.6% | 24.3% | Mistral wins (+11.1 pts) |

Sources: Mistral AI blog post, Hugging Face evaluations, Llama 2 paper.

Mistral 7B wins on 6 of 7 benchmarks against Llama 2 13B. The only loss is ARC Challenge, and even there the gap is small.

On MMLU, Mistral 7B scores 60.1% vs Llama 2 13B's 54.8%. That's a 5.3-point advantage for a model that's half the size. On HumanEval (coding), the gap is massive: 30.5% vs 18.3%.

Something interesting is happening with small models.

How is this possible?

Under the original GPT-3 scaling laws (Kaplan et al.), more parameters = better performance. A 13B model should beat a 7B model, period.

But the Chinchilla scaling laws added an important caveat: the ratio of parameters to training data matters. Most models before Chinchilla were undertrained relative to their size. You can get better results by making the model smaller and training it on more data.

I don't know Mistral's training details (they haven't published a paper), but the likely explanation involves:

| Factor | Llama 2 13B | Mistral 7B (estimated) | |--------|-------------|----------------------| | Parameters | 13.0B | 7.2B | | Training tokens | 2.0T | Likely 2T+ | | Tokens/parameter ratio | 154 | ~278+ | | Architecture optimizations | Standard Transformer | Grouped-query attention, sliding window attention |

The tokens-per-parameter ratio tells the story. If Mistral trained on as many or more tokens than Llama 2 but with half the parameters, the resulting model is more "data-efficient." Every parameter encodes more information.

The architecture innovations

Mistral 7B uses two techniques that reduce compute costs without reducing quality:

Grouped-Query Attention (GQA): Instead of separate key-value heads for each attention head, GQA shares them across groups. This reduces memory usage and speeds up inference by roughly 1.4x with minimal quality loss.

Sliding Window Attention: Instead of attending to the full context at every layer, each layer only attends to a fixed window (4,096 tokens). But because the model has 32 layers, information from the full context can still propagate through the stack. The effective context length is much larger than the window size.

The result:

| Metric | Llama 2 7B | Mistral 7B | Improvement | |--------|-----------|-----------|-------------| | Inference speed (tokens/s, A100) | ~120 | ~150 | +25% | | Memory usage (FP16) | 14.2GB | 14.5GB | Similar | | Memory usage (4-bit quantized) | ~4.2GB | ~4.3GB | Similar | | Effective context length | 4K | 8K+ (sliding window) | 2x+ |

Source: Community benchmarks from Hugging Face, my testing on Together AI.

So Mistral 7B is faster to run and supports longer context than Llama 2 7B, while being significantly more capable. The inference cost per quality unit is better on every dimension.

What this means for the "small model" category

I used to think of 7B models as the "demo tier." Good for prototyping, not good enough for production. Mistral 7B changes that.

| Use case | Llama 2 13B quality | Mistral 7B quality | Can Mistral replace 13B? | |----------|--------------------|--------------------|--------------------------| | Chatbot | Good | Good+ | Yes | | Code generation | Weak | Moderate | Yes (and better) | | Summarization | Good | Good | Yes | | Math/reasoning | Weak | Moderate | Yes (and better) | | Translation | Good | Good | Yes |

For most production use cases, there's no longer a reason to run a 13B model when a 7B model does the same job (or better) at lower cost.

The cost implications are significant:

| Model | GPU memory needed | Can run on RTX 4090? | Inference cost (cloud) | |-------|------------------|---------------------|----------------------| | Llama 2 7B | ~14GB FP16 | Yes | ~$0.0004/1K tokens | | Mistral 7B | ~14.5GB FP16 | Yes | ~$0.0004/1K tokens | | Llama 2 13B | ~26GB FP16 | Barely (quantized) | ~$0.0006/1K tokens |

If Mistral 7B matches Llama 2 13B quality at 7B cost, you save roughly 33% on inference for the same output quality. At scale, that's real money.

The Mistral AI story

Mistral AI is a Paris-based startup founded in May 2023 by ex-Google DeepMind and ex-Meta researchers. They raised $113 million at seed stage, which was one of the largest seed rounds in European tech history.

From founding to first model release: 4 months. From $113M raise to a model that beats Llama 2 13B: 4 months. That's an extraordinary pace, and it suggests they had the training pipeline mostly figured out before the company was even formally created.

The way they released Mistral 7B was also unusual: a magnet link on Twitter. No blog post (that came later). No paper. No PR cycle. Just "here are the weights." The AI research community had never seen a company launch its first product via a torrent link.

My take

Mistral 7B is a signal that the relationship between model size and model quality is not what we assumed. Architecture, training data quality, and training duration matter as much as (or more than) raw parameter count.

If a 7B model can beat a 13B model, maybe a 13B model can beat a 70B model. Maybe a 70B model can beat a 175B model. The Chinchilla insight (more data, fewer parameters) combined with architecture innovations (GQA, sliding window) is compressing what "good" looks like into smaller packages.

I'm updating my mental model. The question is no longer "how big is the model?" It's "how well was it trained?"


If you found this interesting, you might also like:

-- dataku

More from dataku