o3 and the reasoning model cost problem

OpenAI's o3 is their most powerful reasoning model. It also has the most unpredictable cost profile of any model I've tested.

I measured token consumption across 100 problems. The variance is staggering.

Token consumption distribution

| Problem difficulty | Avg thinking tokens | Avg output tokens | Total avg | Min total | Max total | |-------------------|--------------------|--------------------|-----------|-----------|-----------| | Easy (25 problems) | 1,200 | 350 | 1,550 | 480 | 4,200 | | Medium (25 problems) | 4,800 | 520 | 5,320 | 1,800 | 12,400 | | Hard (25 problems) | 14,200 | 680 | 14,880 | 4,600 | 38,000 | | Very hard (25 problems) | 28,400 | 820 | 29,220 | 8,200 | 52,000 |

Sources: My measurements, 100 math and reasoning problems on o3, September 2025.

On easy problems, o3 uses about 1,200 thinking tokens. On very hard problems, it uses 28,400. A 24x range.

The max column is alarming. One very hard problem consumed 52,000 tokens. At o3 pricing, that single query cost about $3.12.

The cost per problem

| Difficulty | Avg cost per problem | Avg cost per correct answer | o3 accuracy | |-----------|---------------------|---------------------------|------------| | Easy | $0.093 | $0.097 | 96% | | Medium | $0.32 | $0.36 | 88% | | Hard | $0.89 | $1.13 | 79% | | Very hard | $1.75 | $2.92 | 60% |

Sources: OpenAI pricing, my measurements.

Easy problems: $0.09. Very hard problems: $1.75. At the "very hard" tier, the cost per correct answer jumps to $2.92 because o3 only gets 60% of them right. You're paying for wrong answers too.

Comparison with other reasoning models

| Model | Avg cost per hard problem | Accuracy on hard | Cost per correct (hard) | |-------|--------------------------|-----------------|----------------------| | o3 | $0.89 | 79% | $1.13 | | Claude Opus 4 (thinking) | $0.18 | 76% | $0.24 | | DeepSeek R1 | $0.024 | 74% | $0.032 | | Gemini 2.5 Pro (thinking) | $0.11 | 72% | $0.15 | | o3-mini | $0.08 | 68% | $0.12 |

Sources: OpenAI, Anthropic, DeepSeek pricing, my evaluation.

o3 has the highest accuracy on hard problems (79%). But it costs 35x more than DeepSeek R1 per correct answer ($1.13 vs $0.032).

The 5-point accuracy advantage of o3 over Claude Opus 4 thinking (79% vs 76%) costs 4.7x more. That premium buys you: 3 extra correct answers out of 100, at a cost of $89 vs $18.

The variance problem

The real issue isn't the average cost. It's the variance.

| Metric | o3 | DeepSeek R1 | Claude Opus 4 (thinking) | |--------|-----|-------------|-------------------------| | Avg tokens (hard problem) | 14,880 | 7,800 | 8,400 | | Std deviation | 11,200 | 2,400 | 3,100 | | Coefficient of variation | 75% | 31% | 37% | | Max tokens observed | 52,000 | 18,000 | 22,000 |

o3's coefficient of variation (75%) is more than double DeepSeek R1's (31%). This means o3's cost is roughly twice as unpredictable.

For budgeting purposes, this variance is a nightmare. You can't reliably predict your monthly o3 bill because the token consumption per query varies by 100x depending on problem difficulty.

When o3 makes sense

| Scenario | Use o3? | Why | |----------|---------|-----| | Research math at competition level | Yes | Highest accuracy, budget is secondary | | High-stakes reasoning (legal, medical) | Maybe | Accuracy premium matters, but verify outputs | | Production app with budget constraints | No | Variance makes budgeting impossible | | Coding agent tasks | No | Claude Opus 4 is better at coding and cheaper | | Cost-sensitive batch processing | No | DeepSeek R1 at 35x cheaper |

o3 is the model you use when accuracy on the hardest problems is worth any cost. For everything else, the reasoning model market offers better price-to-performance ratios.

My API bill for this 100-problem experiment: $62.40. That's more than I typically spend in a week. The cost problem isn't theoretical. It's the number on my invoice.

If you found this interesting, you might also like:

-- dataku

o3 and the reasoning model cost problem

Token consumption distribution

The cost per problem

Comparison with other reasoning models

The variance problem

When o3 makes sense

More from dataku

The inference cost collapse, in one chart

The AI API price tracker: 5 years of data in one interactive chart

Every AI pricing change in Q4 2025, tracked