Pricing WatchMarch 3, 20255 min read

GPT-4.5 is the most expensive model ever released. Is it worth it?

$75 per million input tokens. That's 500x more than GPT-4o mini. I ran GPT-4.5 through my evaluation suite. It's good. Really good. But at this price, it only makes economic sense for a very narrow set of tasks.

$75 per million input tokens. $150 per million output tokens.

OpenAI released GPT-4.5 as a "research preview" and the pricing made my spreadsheet throw an error. I thought I had a typo.

I didn't.

The pricing in context

| Model | Input/M tokens | Output/M tokens | Cost for 1000 queries (avg 500 tokens out) | |-------|---------------|-----------------|-------------------------------------------| | GPT-4o mini | $0.15 | $0.60 | $0.38 | | Gemini 2.0 Flash | $0.10 | $0.40 | $0.25 | | Claude 3.7 Sonnet | $3.00 | $15.00 | $8.50 | | GPT-4o | $2.50 | $10.00 | $6.25 | | Claude 3 Opus | $15.00 | $75.00 | $41.25 | | GPT-4.5 | $75.00 | $150.00 | $112.50 |

Sources: OpenAI, Anthropic, Google pricing pages.

GPT-4.5 costs 500x more per input token than GPT-4o mini. It costs 5x more than Claude 3 Opus, which was previously the most expensive major API model.

For 1,000 queries at typical length, you're looking at $112.50 on GPT-4.5 vs $0.38 on GPT-4o mini. A 296x cost difference.

The benchmarks

Is it worth it? Let me show you the quality data.

| Benchmark | GPT-4.5 | Claude 3.7 Sonnet | GPT-4o | Gemini 2.5 Pro | |-----------|---------|-------------------|--------|---------------| | MMLU | 90.8% | 89.4% | 88.7% | 89.0% | | GPQA Diamond | 65.0% | 62.1% | 53.6% | 59.8% | | MATH (500) | 81.6% | 78.3% | 76.6% | 79.4% | | HumanEval | 93.1% | 95.2% | 90.2% | 91.8% | | IFEval | 88.7% | 88.3% | 85.4% | 86.2% | | SimpleQA | 62.5% | 51.2% | 47.8% | 49.3% | | Chatbot Arena Elo | 1268 | 1274 | 1261 | 1270 |

Sources: OpenAI GPT-4.5 technical report, Artificial Analysis, LMSYS Chatbot Arena.

GPT-4.5 leads on MMLU (90.8%), GPQA (65.0%), and SimpleQA (62.5%). The SimpleQA score is particularly notable: 62.5% is the highest I've seen, beating Claude 3.7 Sonnet by 11 points. OpenAI explicitly optimized GPT-4.5 for factual accuracy, and it shows.

But it loses to Claude 3.7 Sonnet on HumanEval (93.1% vs 95.2%) and Chatbot Arena Elo (1268 vs 1274).

My 300-prompt evaluation

| Category | GPT-4.5 | Claude 3.7 Sonnet | Cost ratio (GPT-4.5 vs Claude) | |----------|---------|-------------------|-------------------------------| | Coding | 82% | 88% | 25x more expensive | | Factual Q&A | 92% | 84% | 25x more expensive | | Analysis | 86% | 84% | 25x more expensive | | Creative writing | 78% | 86% | 25x more expensive | | Instruction following | 84% | 86% | 25x more expensive | | Math | 78% | 76% | 25x more expensive |

GPT-4.5 wins on factual Q&A (92% vs 84%, a big gap) and ties or slightly leads on analysis and math. Claude wins on coding, creative writing, and instruction following.

For 25x the price, GPT-4.5 is better at facts and slightly better at analysis. It's worse at coding and writing.

The "cost per quality point" analysis

| Model | Avg quality score | Price per query | Cost per quality point | |-------|------------------|----------------|----------------------| | GPT-4o mini | 68% | $0.00038 | $0.00056 | | Gemini 2.0 Flash | 72% | $0.00025 | $0.00035 | | GPT-4o | 80% | $0.00625 | $0.0078 | | Claude 3.7 Sonnet | 84% | $0.0085 | $0.010 | | GPT-4.5 | 83% | $0.1125 | $0.136 |

GPT-4.5's cost per quality point is 13.6x higher than Claude 3.7 Sonnet's. You're paying 13x more for each percentage point of quality, and you're not even getting more quality points on aggregate.

Where GPT-4.5 makes sense

The one area: factual accuracy for high-stakes applications.

If you're building a medical information system where getting facts wrong has serious consequences, the 8-point advantage on factual Q&A might justify the price. If you're in legal, financial, or scientific domains where accuracy matters more than cost, GPT-4.5's SimpleQA dominance is meaningful.

For everything else? The math doesn't work.

| Use case | Recommendation | |----------|---------------| | High-stakes factual Q&A | GPT-4.5 (if budget allows) | | Coding | Claude 3.7 Sonnet (better AND cheaper) | | General chat | Claude 3.7 Sonnet or GPT-4o | | Cost-sensitive applications | Gemini 2.0 Flash or GPT-4o mini | | Reasoning/math | DeepSeek R1 or reasoning models |

OpenAI called this a "research preview," which I think is their way of saying "we know the pricing is wild, this isn't for production use." Fair enough. But it does tell us where OpenAI thinks the value is: factual precision at any cost.

My spreadsheet now has a cell that says "$75.00" and it still looks wrong every time I open the file. Some numbers take time to accept.


If you found this interesting, you might also like:

-- dataku

More from dataku