Claude 3.5 Sonnet is better than Claude 3 Opus. And it's 5x cheaper.
The mid-tier model just beat the flagship. I ran Claude 3.5 Sonnet through every test I used for Opus, and it wins on 71% of them. At $3/M tokens vs $15, the value math is absurd.
I need to update my model hierarchy chart again.
Anthropic released Claude 3.5 Sonnet today, and it's better than Claude 3 Opus. Not "competitive with." Not "approaching." Better. The mid-tier model just ate the flagship.
This isn't supposed to happen. The $15/M token model isn't supposed to lose to the $3/M token model. But here we are, and the data is clear.
Head-to-head: Claude 3.5 Sonnet vs Claude 3 Opus
I ran my standard 300-prompt evaluation:
| Category | Claude 3.5 Sonnet | Claude 3 Opus | Winner | Margin | |----------|------------------|--------------|--------|--------| | Factual Q&A (50) | 4.22 | 4.08 | 3.5 Sonnet | +0.14 | | Code generation (50) | 4.48 | 4.18 | 3.5 Sonnet | +0.30 | | Creative writing (50) | 4.28 | 4.31 | Opus | +0.03 | | Summarization (50) | 4.32 | 4.21 | 3.5 Sonnet | +0.11 | | Reasoning (50) | 4.34 | 4.09 | 3.5 Sonnet | +0.25 | | Instruction following (50) | 4.38 | 4.26 | 3.5 Sonnet | +0.12 | | Overall | 4.34 | 4.19 | 3.5 Sonnet | +0.15 |
Source: My evaluation, 300 prompts, blind rating, June 2024.
Claude 3.5 Sonnet wins 5 of 6 categories. Opus only wins on creative writing, and barely (4.31 vs 4.28). The biggest gap is code generation: 4.48 vs 4.18 (+0.30). That's a substantial improvement.
Overall: 4.34 vs 4.19. Claude 3.5 Sonnet is the best model I've tested. Period.
Compared to the whole field
| Model | My overall score | Input $/M tokens | Output $/M tokens | |-------|-----------------|------------------|-------------------| | Claude 3.5 Sonnet | 4.34 | $3.00 | $15.00 | | Claude 3 Opus | 4.19 | $15.00 | $75.00 | | GPT-4o | 4.18 | $5.00 | $15.00 | | GPT-4 Turbo | 4.11 | $10.00 | $30.00 | | Gemini 1.5 Pro | 4.02 | $3.50 | $10.50 | | Llama 3 70B | 3.68 | $0.90 | $0.90 | | Claude 3 Sonnet | 3.89 | $3.00 | $15.00 |
Source: My evaluations, various dates in 2024, same methodology.
Claude 3.5 Sonnet (4.34) beats Claude 3 Opus (4.19) by 0.15 points and GPT-4o (4.18) by 0.16 points. It's not a marginal lead. The previous best score in my evaluation history was Claude 3 Opus at 4.22 (which I gave in March; it dropped to 4.19 in my June re-evaluation as I refined my prompts).
4.34 is a new high watermark.
The pricing math is what makes this wild
| Comparison | Input cost ratio | Output cost ratio | Quality ratio | |-----------|-----------------|-------------------|--------------| | 3.5 Sonnet vs Opus | 5x cheaper | 5x cheaper | 3.5 Sonnet is better | | 3.5 Sonnet vs GPT-4o | 1.7x cheaper input, same output | Same | 3.5 Sonnet is better | | 3.5 Sonnet vs GPT-4 Turbo | 3.3x cheaper | 2x cheaper | 3.5 Sonnet is better |
Claude 3.5 Sonnet is 5x cheaper than Opus AND better. It's 1.7x cheaper than GPT-4o AND better. This breaks the normal price-quality trade-off.
In my data history, the best model has always been one of the most expensive. GPT-4 was the best and the priciest. Claude 3 Opus was the best and the priciest. Now the best model is a mid-tier product priced at $3/$15.
The code generation gap is real
The 0.30 point advantage on code generation deserves a closer look. I ran a separate 50-problem coding evaluation:
| Coding task type | Claude 3.5 Sonnet | Claude 3 Opus | GPT-4o | |-----------------|------------------|--------------|--------| | Python function (10 problems) | 90% pass | 78% pass | 82% pass | | Debug existing code (10 problems) | 82% correct fix | 68% correct fix | 74% correct fix | | Multi-file refactor (10 problems) | 76% correct | 62% correct | 66% correct | | Explain code (10 problems) | 4.6/5 clarity | 4.3/5 clarity | 4.2/5 clarity | | Write tests (10 problems) | 84% useful tests | 72% useful tests | 78% useful tests | | Average | 83.2% / 4.6 | 70.6% / 4.3 | 75.2% / 4.2 |
Source: My evaluation, 50 coding problems, June 2024.
Claude 3.5 Sonnet at 90% pass rate on Python functions. 82% on debugging. 76% on multi-file refactors. These are numbers I haven't seen from any model before.
The debugging improvement is especially notable: 82% vs 68% for Opus. Claude 3.5 Sonnet is significantly better at understanding buggy code and producing correct fixes. This alone would justify switching for any coding-heavy use case.
What happened? How is the cheaper model better?
I don't have inside information on Anthropic's training process. But I can guess based on the pattern:
- Training data improvements. Three months of additional training data curation, plus whatever Anthropic learned from Claude 3's deployment.
- Architecture efficiency. Sonnet is a smaller, more efficient model than Opus. If you can make the smaller model better through training innovations, it's cheaper to serve.
- RLHF improvements. The instruction following scores jumped significantly (4.38 vs 4.26), suggesting better alignment training.
- Possible distillation. Claude 3.5 Sonnet may have been trained using outputs from Opus as training data. The student model surpassing the teacher is a known phenomenon in distillation.
What this means for the market
The implications are big:
Claude 3 Opus is now obsolete for most use cases. If a model at $3/$15 beats a model at $15/$75, there is no rational reason to use the expensive one. Creative writing is the only category where Opus has a tiny edge, and 0.03 points isn't worth 5x the price.
GPT-4o has a serious competitor. At the same output price ($15/M tokens) and lower input price ($3 vs $5), Claude 3.5 Sonnet is both cheaper and better. OpenAI will need to respond.
The "flagship is always best" assumption is broken. This is the first time a company's mid-tier model has objectively surpassed its top-tier model. It suggests that model quality isn't just about scale. Training innovations can matter more than parameter count.
I'm updating my model recommendation list. Claude 3.5 Sonnet is the default recommendation for almost everything. The data is overwhelming.
If you found this interesting, you might also like:
- Claude vs GPT-4: my first head-to-head data comparison
- Mistral Large vs GPT-4 vs Claude 3 Opus: the three-way benchmark
- DALL-E's first images vs what people expected: a data comparison
- GPT-3 vs GPT-J: the first real open source challenger, in data
- Google's PaLM has 540 billion parameters. Let me put that number in context.
-- dataku