o1 and 'reasoning' models: the benchmark scores look different this time
OpenAI's o1 trades speed for accuracy by 'thinking' before answering. The math and coding benchmarks are way up, but the costs are 6x higher per task. I broke down the cost-per-correct-answer metric and it's actually competitive.
OpenAI just introduced a new kind of model. Not bigger. Not faster. Smarter.
o1 (codenamed "strawberry") doesn't just generate a response. It thinks first. It produces a chain of reasoning tokens before giving its answer. Those thinking tokens cost money and take time. But the accuracy on hard problems is unlike anything I've measured before.
This changes the math on how we evaluate AI models. Literally.
The benchmark numbers
| Benchmark | o1-preview | o1-mini | GPT-4o | Claude 3.5 Sonnet | |-----------|-----------|---------|--------|-------------------| | MATH (competition math) | 83.3% | 70.2% | 60.3% | 71.1% | | GPQA Diamond | 73.3% | 60.0% | 53.6% | 59.4% | | AIME 2024 (math competition) | 74.4% | 56.7% | 13.4% | ~20% est. | | Codeforces (competitive programming) | 89th percentile | 86th percentile | 11th percentile | ~15th est. | | HumanEval | 92.4% | 93.1% | 90.2% | 92.0% | | MMLU | 90.8% | 85.2% | 88.7% | 88.7% | | SWE-bench Verified | 41.3% | 35.8% | 33.2% | 33.4% | | LMSYS Chatbot Arena Elo | ~1340 | ~1295 | ~1285 | ~1305 |
Sources: OpenAI o1 system card, LMSYS Chatbot Arena, prior benchmark reports.
Look at MATH: 83.3% for o1-preview vs 60.3% for GPT-4o. A 23-point jump. On AIME 2024 (a genuinely hard math competition), o1-preview scores 74.4% vs GPT-4o's 13.4%. That's not an improvement. That's a category change.
On Codeforces competitive programming, o1-preview is at the 89th percentile of human competitors. GPT-4o is at the 11th. Again, a different category entirely.
But on MMLU (general knowledge), the gap is smaller: 90.8% vs 88.7%. And on HumanEval (basic coding), it's 92.4% vs 90.2%. The reasoning advantage is biggest on hard problems.
How o1 works (the data view)
o1 generates "thinking tokens" before producing its answer. These tokens are the model reasoning through the problem, exploring approaches, checking its work.
| Metric | o1-preview | o1-mini | GPT-4o | |--------|-----------|---------|--------| | Avg thinking tokens (easy question) | ~200 | ~100 | 0 | | Avg thinking tokens (medium question) | ~1,500 | ~800 | 0 | | Avg thinking tokens (hard question) | ~8,000 | ~4,000 | 0 | | Avg thinking tokens (AIME-level) | ~15,000+ | ~8,000 | 0 | | Time per response (easy) | 3-5 sec | 1-3 sec | <1 sec | | Time per response (hard) | 30-120 sec | 15-45 sec | 2-5 sec |
Source: My measurements from o1 API usage, September 2024. Thinking token counts estimated from billing data.
On an easy question, o1-preview thinks for ~200 tokens. On a hard math problem, it thinks for 15,000+. The model is allocating more compute to harder problems automatically. This is the "inference-time compute" thesis: instead of making the model bigger, make it think longer.
The cost analysis (this is the tricky part)
| Model | Input $/M tokens | Output $/M tokens | Thinking token cost | MATH score | |-------|-----------------|-------------------|--------------------|----| | o1-preview | $15.00 | $60.00 | Same as output ($60/M) | 83.3% | | o1-mini | $3.00 | $12.00 | Same as output ($12/M) | 70.2% | | GPT-4o | $5.00 | $15.00 | N/A | 60.3% | | Claude 3.5 Sonnet | $3.00 | $15.00 | N/A | 71.1% |
Sources: OpenAI pricing, September 2024.
o1-preview is expensive. $15/$60 per million tokens, and the thinking tokens count as output tokens at $60/M. A hard math problem that generates 15,000 thinking tokens + 500 answer tokens costs:
- Thinking: 15,000 tokens * $0.000060 = $0.90
- Answer: 500 tokens * $0.000060 = $0.03
- Input: ~200 tokens * $0.000015 = $0.003
- Total per hard problem: ~$0.93
GPT-4o on the same problem:
- Output: 500 tokens * $0.000015 = $0.0075
- Input: ~200 tokens * $0.000005 = $0.001
- Total per problem: ~$0.009
o1-preview costs 100x more per problem on hard questions. But it gets the answer right 83% of the time vs GPT-4o's 60%.
The cost-per-correct-answer metric
This is the metric I think we should be using. Not cost per token. Not cost per request. Cost per correct answer.
| Model | Cost per MATH problem | MATH accuracy | Cost per correct answer | |-------|----------------------|--------------|------------------------| | o1-preview | $0.93 | 83.3% | $1.12 | | o1-mini | $0.15 | 70.2% | $0.21 | | GPT-4o | $0.009 | 60.3% | $0.015 | | Claude 3.5 Sonnet | $0.009 | 71.1% | $0.013 | | GPT-4o (5 attempts, best-of) | $0.045 | ~72% | $0.063 |
Source: My calculations based on pricing and MATH benchmark scores.
On a per-correct-answer basis, GPT-4o and Claude 3.5 Sonnet are still cheaper ($0.013-0.015). But the accuracy is capped. If you need >72% accuracy on hard math, your only option is o1.
And o1-mini at $0.21 per correct answer is surprisingly cost-effective for its accuracy level (70.2%). It costs 3x more per correct answer than GPT-4o's best-of-5 strategy, but with similar accuracy and no need to run 5 attempts.
When o1 makes economic sense
| Scenario | o1 cost premium | o1 accuracy gain | Worth it? | |----------|----------------|-----------------|-----------| | Casual Q&A | 100x | +2 MMLU points | No | | Code review | 50x | +8% SWE-bench | Maybe | | Math homework help | 100x | +23 MATH points | Only if accuracy matters | | Medical reasoning | 100x | +20 GPQA points | Yes (high stakes) | | Competitive programming | 100x | 78 percentile points | Yes (if you need it) | | General chatbot | 100x | +55 Elo points | No | | Financial analysis | 50x | Unknown | Depends on stakes |
o1 makes sense when:
- The problem is hard (easy problems don't benefit from thinking tokens)
- Getting the right answer is worth significantly more than the cost of compute
- Speed doesn't matter (o1 is 10-60x slower than GPT-4o)
For a hedge fund analyzing a complex trade, paying $1 per correct analysis is nothing. For a customer support bot, paying 100x more for a 2% accuracy improvement is insane.
The bigger picture
o1 introduces a new axis to model evaluation. Before o1, models competed on three dimensions: quality, speed, and cost. Now there's a fourth: how much inference-time compute to allocate.
| Dimension | Traditional models | Reasoning models (o1) | |-----------|-------------------|----------------------| | Quality | Fixed at inference time | Scales with thinking time | | Speed | Fast (sub-second) | Slow (seconds to minutes) | | Cost | Predictable per token | Variable (depends on problem difficulty) | | Best for | All general tasks | Hard reasoning tasks |
The reasoning model approach is brilliant for hard problems and wasteful for easy ones. The ideal system would route easy questions to GPT-4o and hard questions to o1. I suspect OpenAI will build exactly this.
My evaluation framework just got a new dimension. I'm simultaneously excited about the capability jump and worried about my spreadsheet getting too wide. The data is telling us something new: thinking is worth paying for, but only when the problem is worth thinking about.
If you found this interesting, you might also like:
- InstructGPT and RLHF: what the training data tells us
- I ran GPT-3 on the same 50 questions every month for a year. Here's the drift.
- GPT-4 benchmark scores are insane. But let me show you the fine print.
- Google Gemini benchmarks vs GPT-4: reading the fine print
- Llama 3.1 405B: the first truly GPT-4 class open model. My benchmark data.
-- dataku