Wait, GPT-3 costs HOW much per token?
I spent a weekend calculating the actual per-word cost of GPT-3's different engines. The price difference between Davinci and Ada is wild, and most people are using the wrong one.
I spent my entire Saturday doing something that might qualify as unhealthy.
I pulled every invoice from my OpenAI API account, matched each one to the engine I used, counted the tokens, and built a spreadsheet comparing the actual per-word cost of GPT-3's four engines. My partner asked what I was doing. I said "important research." She didn't believe me.
But here's the thing. The pricing gap between GPT-3's engines is genuinely wild, and I think most developers using the API are leaving money on the table because they default to Davinci for everything.
The four engines, priced out
OpenAI offers four GPT-3 engines, each a different size with different capabilities. Here's what they cost as of January 2021, per the OpenAI pricing page:
| Engine | Parameters | Price per 1K tokens | Relative cost | |--------|-----------|---------------------|---------------| | Davinci | 175B | $0.0600 | 1x (baseline) | | Curie | 6.7B | $0.0060 | 10x cheaper | | Babbage | 1.3B | $0.0012 | 50x cheaper | | Ada | 350M | $0.0008 | 75x cheaper |
Read that last column again. Ada is 75 times cheaper than Davinci.
A rough token-to-word ratio is about 1.3 tokens per English word (this varies, but it's close enough for cost estimation). So generating 1,000 words with Davinci costs about $0.078. With Ada? About $0.001.
So why does everyone use Davinci?
I asked around in a few Discord servers and checked the OpenAI developer forums. The answer is simple: the examples in OpenAI's docs default to Davinci. The playground defaults to Davinci. When you copy-paste the quickstart code, it uses Davinci.
And look, Davinci IS the best model. It handles complex reasoning, subtle writing, and multi-step logic better than the others. That's not the debate.
The debate is whether you need that for every task.
Where the smaller engines actually win
I ran 200 test prompts across all four engines and scored the outputs on a 1-5 scale for "task completion." Here's what I found:
| Task type | Davinci (175B) | Curie (6.7B) | Babbage (1.3B) | Ada (350M) | |-----------|---------------|--------------|----------------|------------| | Classification (sentiment, topic) | 4.8 | 4.6 | 4.1 | 3.9 | | Simple Q&A (factual) | 4.7 | 4.3 | 3.2 | 2.8 | | Summarization | 4.6 | 4.2 | 3.0 | 2.1 | | Creative writing | 4.9 | 3.5 | 2.1 | 1.4 | | Code generation | 4.5 | 3.1 | 1.8 | 1.2 | | Text parsing/extraction | 4.3 | 4.1 | 3.8 | 3.6 |
Look at classification. Curie scores 4.6 out of 5 on sentiment analysis and topic classification. That's 96% of Davinci's quality at 10% of the cost. For text parsing and extraction, even Ada holds up at 3.6.
If you're building a product that classifies customer feedback into categories? Curie. If you're parsing structured data from messy text? Babbage might be enough. If you're writing poetry or generating code, yes, pay for Davinci.
My actual monthly cost difference
I went back through my December usage. I'd been using Davinci for everything like a fool. Here's the before-and-after calculation:
| | All Davinci | Optimized engine selection | |---|-----------|--------------------------| | Classification tasks (40% of calls) | $48.00 | $4.80 (Curie) | | Parsing tasks (25% of calls) | $30.00 | $0.60 (Babbage) | | Creative/complex (35% of calls) | $42.00 | $42.00 (Davinci) | | Monthly total | $120.00 | $47.40 |
That's a 60.5% cost reduction. For a side project. For a startup processing millions of API calls, the savings would be in the thousands.
The kaizen of token counting
One more thing I found while deep in the spreadsheet rabbit hole (my ikigai, apparently): token counting is not intuitive. "Hello world" is 2 tokens. "Unconstitutional" is 4 tokens. A single Japanese character can be 2-3 tokens.
This matters because you pay per token, not per word. If your prompts contain lots of uncommon words, technical jargon, or non-English text, your actual costs will be higher than naive word-count estimates.
I built a tiny Python script that uses OpenAI's tokenizer to pre-calculate costs before sending prompts. It's saved me from a few surprise bills. The OpenAI API documentation has a tokenizer tool buried in there if you want to check your own text.
The bottom line, in data
75x price difference between the cheapest and most expensive GPT-3 engine. Most classification tasks work fine on Curie. Most developers don't know this because the defaults point to Davinci.
Check your use case. Run a quick benchmark. You might be paying 10x more than you need to.
I spent a Saturday finding this out so you don't have to. You're welcome.
-- dataku