Pricing WatchSeptember 22, 20254 min read

o3 and the reasoning model cost problem

OpenAI's o3 uses up to 10x the tokens of a standard model to "think." On hard math problems, a single o3 query can cost $2. I measured the token consumption across 100 problems and the variance is massive: 500 tokens to 50,000.

OpenAI's o3 is their most powerful reasoning model. It also has the most unpredictable cost profile of any model I've tested.

I measured token consumption across 100 problems. The variance is staggering.

Token consumption distribution

| Problem difficulty | Avg thinking tokens | Avg output tokens | Total avg | Min total | Max total | |-------------------|--------------------|--------------------|-----------|-----------|-----------| | Easy (25 problems) | 1,200 | 350 | 1,550 | 480 | 4,200 | | Medium (25 problems) | 4,800 | 520 | 5,320 | 1,800 | 12,400 | | Hard (25 problems) | 14,200 | 680 | 14,880 | 4,600 | 38,000 | | Very hard (25 problems) | 28,400 | 820 | 29,220 | 8,200 | 52,000 |

Sources: My measurements, 100 math and reasoning problems on o3, September 2025.

On easy problems, o3 uses about 1,200 thinking tokens. On very hard problems, it uses 28,400. A 24x range.

The max column is alarming. One very hard problem consumed 52,000 tokens. At o3 pricing, that single query cost about $3.12.

The cost per problem

| Difficulty | Avg cost per problem | Avg cost per correct answer | o3 accuracy | |-----------|---------------------|---------------------------|------------| | Easy | $0.093 | $0.097 | 96% | | Medium | $0.32 | $0.36 | 88% | | Hard | $0.89 | $1.13 | 79% | | Very hard | $1.75 | $2.92 | 60% |

Sources: OpenAI pricing, my measurements.

Easy problems: $0.09. Very hard problems: $1.75. At the "very hard" tier, the cost per correct answer jumps to $2.92 because o3 only gets 60% of them right. You're paying for wrong answers too.

Comparison with other reasoning models

| Model | Avg cost per hard problem | Accuracy on hard | Cost per correct (hard) | |-------|--------------------------|-----------------|----------------------| | o3 | $0.89 | 79% | $1.13 | | Claude Opus 4 (thinking) | $0.18 | 76% | $0.24 | | DeepSeek R1 | $0.024 | 74% | $0.032 | | Gemini 2.5 Pro (thinking) | $0.11 | 72% | $0.15 | | o3-mini | $0.08 | 68% | $0.12 |

Sources: OpenAI, Anthropic, DeepSeek pricing, my evaluation.

o3 has the highest accuracy on hard problems (79%). But it costs 35x more than DeepSeek R1 per correct answer ($1.13 vs $0.032).

The 5-point accuracy advantage of o3 over Claude Opus 4 thinking (79% vs 76%) costs 4.7x more. That premium buys you: 3 extra correct answers out of 100, at a cost of $89 vs $18.

The variance problem

The real issue isn't the average cost. It's the variance.

| Metric | o3 | DeepSeek R1 | Claude Opus 4 (thinking) | |--------|-----|-------------|-------------------------| | Avg tokens (hard problem) | 14,880 | 7,800 | 8,400 | | Std deviation | 11,200 | 2,400 | 3,100 | | Coefficient of variation | 75% | 31% | 37% | | Max tokens observed | 52,000 | 18,000 | 22,000 |

o3's coefficient of variation (75%) is more than double DeepSeek R1's (31%). This means o3's cost is roughly twice as unpredictable.

For budgeting purposes, this variance is a nightmare. You can't reliably predict your monthly o3 bill because the token consumption per query varies by 100x depending on problem difficulty.

When o3 makes sense

| Scenario | Use o3? | Why | |----------|---------|-----| | Research math at competition level | Yes | Highest accuracy, budget is secondary | | High-stakes reasoning (legal, medical) | Maybe | Accuracy premium matters, but verify outputs | | Production app with budget constraints | No | Variance makes budgeting impossible | | Coding agent tasks | No | Claude Opus 4 is better at coding and cheaper | | Cost-sensitive batch processing | No | DeepSeek R1 at 35x cheaper |

o3 is the model you use when accuracy on the hardest problems is worth any cost. For everything else, the reasoning model market offers better price-to-performance ratios.

My API bill for this 100-problem experiment: $62.40. That's more than I typically spend in a week. The cost problem isn't theoretical. It's the number on my invoice.


If you found this interesting, you might also like:

-- dataku

More from dataku