Pricing WatchOctober 7, 20246 min read

The inference cost of reasoning models: o1 vs Claude 3.5 Sonnet per correct answer

Reasoning models use more tokens to think. But if they get the answer right more often, the cost per CORRECT answer might actually be lower. I ran the math on 500 coding problems. The results surprised me.

Last month I wrote about o1's benchmark scores. This month I went deeper: instead of "how accurate is it?", I asked "how much does accuracy cost?"

The standard way to compare models is cost per million tokens. But that's like comparing cars by fuel cost per mile without asking how often they get you to the right destination. For reasoning models, the right metric is cost per correct answer.

I ran 500 coding problems. Here's the full cost analysis.

The experiment

500 Python coding problems from a mix of LeetCode medium/hard, real GitHub issues, and my own test suite. Each problem has a verifiable correct answer (the code either passes tests or it doesn't).

I ran every problem through five models and logged: total tokens used (input + thinking + output), total cost, and pass/fail.

| Model | Problems solved | Solve rate | Avg tokens per problem | Avg cost per problem | |-------|----------------|-----------|----------------------|---------------------| | o1-preview | 341 | 68.2% | 12,400 | $0.78 | | o1-mini | 298 | 59.6% | 5,800 | $0.08 | | GPT-4o | 267 | 53.4% | 1,200 | $0.019 | | Claude 3.5 Sonnet | 289 | 57.8% | 1,400 | $0.022 | | Llama 3.1 70B | 218 | 43.6% | 1,100 | $0.001 |

Source: My experiment, 500 coding problems, October 2024. All costs include input, thinking (where applicable), and output tokens.

o1-preview uses 12,400 tokens on average (mostly thinking tokens). GPT-4o uses 1,200. That's a 10x difference in token consumption.

o1-preview costs $0.78 per problem vs GPT-4o's $0.019. A 41x cost difference per attempt.

But o1-preview solves 68.2% vs GPT-4o's 53.4%. That's a 14.8 percentage point accuracy advantage.

Cost per correct answer

Here's where the math gets interesting:

| Model | Cost per attempt | Solve rate | Cost per correct answer | Relative to cheapest | |-------|-----------------|-----------|------------------------|---------------------| | o1-preview | $0.78 | 68.2% | $1.14 | 475x | | o1-mini | $0.08 | 59.6% | $0.13 | 54x | | GPT-4o | $0.019 | 53.4% | $0.036 | 15x | | Claude 3.5 Sonnet | $0.022 | 57.8% | $0.038 | 16x | | Llama 3.1 70B | $0.001 | 43.6% | $0.0024 | 1x (baseline) |

Source: My calculations from the experiment data.

On a cost-per-correct-answer basis, Llama 3.1 70B is by far the cheapest at $0.0024. o1-preview at $1.14 is 475x more expensive per correct answer. Even o1-mini at $0.13 is 54x more expensive.

But wait. What if you could retry the cheaper models until they get it right?

The retry strategy comparison

What if instead of paying $0.78 for one o1-preview attempt, you use that budget to run 41 GPT-4o attempts and take the best one?

| Strategy | Budget per problem | Expected attempts | P(at least 1 correct) | Cost per correct answer | |----------|-------------------|------------------|----------------------|------------------------| | 1x o1-preview | $0.78 | 1 | 68.2% | $1.14 | | 1x o1-mini | $0.08 | 1 | 59.6% | $0.13 | | 1x GPT-4o | $0.019 | 1 | 53.4% | $0.036 | | 5x GPT-4o (best of 5) | $0.095 | 5 | 97.9% | $0.097 | | 10x GPT-4o (best of 10) | $0.19 | 10 | 99.8% | $0.19 | | 1x Claude 3.5 Sonnet | $0.022 | 1 | 57.8% | $0.038 | | 5x Claude 3.5 Sonnet | $0.11 | 5 | 98.3% | $0.112 | | 41x GPT-4o (same budget as o1-preview) | $0.78 | 41 | ~99.99% | $0.78 |

Source: My calculations. P(at least 1 correct in N tries) = 1 - (1 - solve_rate)^N.

At the same budget ($0.78), running GPT-4o 41 times gives you a 99.99% chance of at least one correct solution vs o1-preview's 68.2% on a single try.

But this comparison is misleading! The retry strategy requires you to know which answer is correct (you need a test suite). o1-preview gives you the right answer more often on the first try without needing verification infrastructure.

When each model actually makes sense

| Scenario | Best choice | Why | |----------|------------|-----| | You have a test suite and can verify answers | GPT-4o (5 retries) | Cheapest path to 97.9% accuracy | | Single shot, must be right first time | o1-preview | 68.2% first-try accuracy is best | | High volume, acceptable error rate | Claude 3.5 Sonnet or GPT-4o | Best cost/quality balance | | Budget constrained, quality flexible | Llama 3.1 70B | $0.0024 per correct answer | | Math-heavy or competition-style problems | o1-preview | MATH score advantage is enormous | | Standard coding tasks | Claude 3.5 Sonnet | 57.8% solve rate at $0.022/attempt |

The key insight: o1's value proposition is about first-try accuracy on hard problems, not about cost efficiency. If you can afford to retry and verify, cheaper models are more economical. If you need the answer right on the first attempt (interactive use, real-time systems, situations where you can't verify), o1 justifies its premium.

Breaking it down by difficulty

| Difficulty | o1-preview solve rate | GPT-4o solve rate | o1 premium justified? | |-----------|----------------------|--------------------|-----------------------| | Easy (LeetCode Easy) | 94% | 89% | No (small gap, high cost) | | Medium (LeetCode Medium) | 76% | 58% | Maybe (18 point gap) | | Hard (LeetCode Hard) | 52% | 21% | Yes (31 point gap) | | Very Hard (competition) | 38% | 6% | Absolutely (32 point gap) |

Source: My experiment, problems categorized by difficulty, October 2024.

On easy problems, o1's accuracy edge (94% vs 89%) doesn't justify the 41x cost premium. On hard problems, o1's edge (52% vs 21%) is enormous, and the alternative (running GPT-4o multiple times) is less likely to converge.

The rule is simple: use o1 for hard problems. Use GPT-4o or Claude 3.5 Sonnet for everything else.

My updated pricing intuition

I used to think about AI costs purely in terms of tokens. Now I think about them in terms of correct answers. The shift from "cost per token" to "cost per correct answer" changes which model looks expensive and which looks cheap.

Reasoning models aren't expensive if the problem is hard enough. They're expensive if the problem is easy and you're overpaying for accuracy you don't need.

The spreadsheet got a new column: "difficulty-adjusted cost efficiency." I think this is how we'll evaluate all models going forward. Not just how much they cost, but how much the right answer costs.


If you found this interesting, you might also like:

-- dataku

More from dataku