Pricing WatchMarch 20, 20236 min read

GPT-4 is 10x more expensive than GPT-3.5. Is it 10x better?

GPT-4 costs $0.03/1K input tokens vs $0.002 for GPT-3.5-turbo. That's a 15x price jump. I ran 500 real-world tasks on both and measured quality. The value proposition is... complicated.

GPT-4 launched on March 14 with impressive benchmarks and a price tag to match.

Here are the numbers that matter:

| Model | Input ($/1K tokens) | Output ($/1K tokens) | Context | Ratio vs GPT-3.5 | |-------|---------------------|----------------------|---------|-------------------| | GPT-3.5-turbo | $0.002 | $0.002 | 4K | 1x | | GPT-4 (8K) | $0.030 | $0.060 | 8K | 15x input, 30x output | | GPT-4 (32K) | $0.060 | $0.120 | 32K | 30x input, 60x output |

Source: OpenAI pricing page, March 2023.

15x more expensive on input. 30x on output. For the 32K context version, it's 60x more expensive on output tokens.

The question I wanted to answer: is GPT-4 actually 15x better? 10x better? 5x better? Is there a number you can put on the quality improvement relative to cost?

My test setup

I collected 500 real-world prompts from my own API usage over the past two months. No synthetic benchmarks. These are prompts I actually send to language models for work:

| Task category | # prompts | Description | |--------------|-----------|-------------| | Email drafting | 75 | Writing professional emails from bullet points | | Code review | 100 | Reviewing Python/JavaScript code for bugs | | Summarization | 80 | Summarizing articles to 2-3 paragraphs | | Data analysis | 60 | Interpreting CSV data and answering questions | | Creative writing | 50 | Blog intros, product descriptions, taglines | | Q&A (factual) | 85 | Answering specific factual questions | | Reasoning | 50 | Logic puzzles, word problems, multi-step tasks |

I ran each prompt through both GPT-3.5-turbo and GPT-4, then rated the outputs on a 1-5 quality scale (blind, randomized order).

The results

| Task category | GPT-3.5 avg score | GPT-4 avg score | Improvement | Worth 15x? | |--------------|-------------------|-----------------|-------------|------------| | Email drafting | 3.8 | 4.1 | +7.9% | No | | Code review | 3.2 | 4.4 | +37.5% | Probably | | Summarization | 3.9 | 4.2 | +7.7% | No | | Data analysis | 3.1 | 4.3 | +38.7% | Yes | | Creative writing | 3.7 | 4.0 | +8.1% | No | | Q&A (factual) | 3.6 | 4.1 | +13.9% | Maybe | | Reasoning | 2.8 | 4.2 | +50.0% | Yes | | Overall | 3.5 | 4.2 | +20.0% | Depends |

This is where it gets interesting.

GPT-4 is 20% better on average across all my tasks. But it costs 15x more (on input tokens) to 30x more (on output tokens). A 20% quality improvement for a 15-30x cost increase is a bad trade if you're optimizing for cost efficiency.

But look at the per-category breakdown. Reasoning improved by 50%. Data analysis by 38.7%. Code review by 37.5%. For those specific tasks, GPT-4 isn't a marginal improvement. It's a qualitative jump from "sometimes wrong" to "usually right."

On email drafting, summarization, and creative writing, the improvement is under 10%. Both models are already good at these tasks. GPT-4 is slightly more polished, slightly better at following detailed instructions. But not 15x-the-price better.

The cost math for a real application

Let me model what this looks like for an actual product. Say you're building a customer support chatbot that handles 10,000 conversations per day, averaging 500 tokens input and 200 tokens output per conversation.

| Model | Input cost/day | Output cost/day | Total/day | Total/month | |-------|---------------|----------------|-----------|-------------| | GPT-3.5-turbo | $10.00 | $4.00 | $14.00 | $420 | | GPT-4 (8K) | $150.00 | $120.00 | $270.00 | $8,100 |

$420/month vs $8,100/month. For customer support, where GPT-3.5 already scores 3.8/5, it's very hard to justify paying 19x more for a ~8% quality improvement.

But say you're building a code analysis tool. Same volume.

| Model | Monthly cost | Quality score | Cost per quality point | |-------|-------------|--------------|----------------------| | GPT-3.5-turbo | $420 | 3.2 | $131 per point | | GPT-4 (8K) | $8,100 | 4.4 | $1,841 per point |

Even for code tasks where GPT-4 shines, the cost per quality point is 14x higher. The absolute quality is better, but the efficiency is worse.

Where GPT-4 is actually worth it

Based on my testing, GPT-4 passes the "worth 15x the price" bar in exactly two scenarios:

1. Tasks where correctness is binary.

If you need a correct answer to a math problem, a code snippet that compiles, or a factual claim that's true, GPT-3.5's lower accuracy means you need human review on more outputs. The cost of that human review can exceed the GPT-4 premium.

| | GPT-3.5 | GPT-4 | |---|---------|-------| | Code compilation rate (my tests) | 67% | 89% | | Factual accuracy on verifiable claims | 74% | 91% | | Logic puzzle correct answers | 52% | 84% |

If each incorrect output costs you $0.50 in human review time, GPT-4 saves money on high-accuracy tasks despite the API premium.

2. Tasks where you can't send it twice.

For one-shot applications (generating a report, writing a complex analysis, answering a question for a user in real time), you don't get to retry cheaply. The expected value of GPT-4's single-attempt quality is higher than GPT-3.5's "run it three times and pick the best."

My recommendation

Use GPT-3.5-turbo for 80% of your tasks. Route to GPT-4 for the 20% where correctness is critical or reasoning depth matters.

A routing approach where you classify prompts by difficulty and send easy ones to GPT-3.5 and hard ones to GPT-4 could capture 90% of GPT-4's quality benefit at 40% of the full GPT-4 cost. I haven't built this yet, but the math works out. Someone will build this as a product soon.

The 15x premium isn't worth it across the board. But for the right tasks, it's a bargain.


If you found this interesting, you might also like:

-- dataku

More from dataku