Codex and the cost of code generation: my first pricing analysis
OpenAI's Codex is in private beta and I got access. I ran 500 code generation requests and tracked the token costs. Generating a Python function costs about $0.003 on average.
I got Codex access two weeks ago and immediately did what any reasonable person would do: I ran 500 code generation requests and built a spreadsheet tracking the token costs of every single one.
My friends think I have a problem. They might be right. But the data is here now, and it's interesting.
What Codex actually costs
Codex is in private beta, and OpenAI is currently offering it for free during the beta period. But it runs on the same token-based system as GPT-3, and the model is based on GPT-3 fine-tuned on code. So we can estimate what it'll cost when pricing kicks in.
I tracked input tokens (the prompt/context) and output tokens (the generated code) for each request:
| Task type | Avg input tokens | Avg output tokens | Total tokens | Est. cost (Davinci rate) | |-----------|-----------------|-------------------|-------------|-------------------------| | Simple function (1 docstring) | 45 | 62 | 107 | $0.006 | | Function with context (imports + description) | 120 | 85 | 205 | $0.012 | | Bug fix (code + error message) | 180 | 94 | 274 | $0.016 | | Multi-function generation | 95 | 210 | 305 | $0.018 | | Code completion (partial function) | 65 | 48 | 113 | $0.007 |
At Davinci rates ($0.06/1K tokens), the average code generation request costs about $0.012. But here's the thing: OpenAI's pricing page already hints that Codex may get its own pricing tier, likely cheaper than Davinci since it's specialized.
If they price it at Curie rates ($0.006/1K tokens), the average drops to about $0.001 per request. A tenth of a cent to generate a function. That's wild.
The 500-request breakdown
I organized my 500 requests by programming language:
| Language | Requests | Avg quality (1-5) | Avg output tokens | Success rate | |----------|----------|-------------------|-------------------|-------------| | Python | 200 | 4.1 | 78 | 87% | | JavaScript | 120 | 3.8 | 82 | 81% | | TypeScript | 60 | 3.5 | 91 | 74% | | SQL | 50 | 3.9 | 54 | 83% | | Bash | 40 | 3.2 | 41 | 72% | | Go | 30 | 2.9 | 96 | 61% |
"Success rate" means the code ran without errors on the first try (for Python, JS, and TS, I actually executed it). Python is clearly the strongest language, which tracks with the training data. Go is the weakest in my sample, though 30 requests is a tiny sample.
Quality scoring was my subjective judgment: 5 means "I'd use this as-is," 3 means "right idea but needs edits," 1 means "this is wrong."
Cost per useful output
Here's the metric I care about most. Not cost per request, but cost per usable code output:
| Language | Cost per request | Success rate | Cost per useful output | |----------|-----------------|-------------|----------------------| | Python | $0.010 | 87% | $0.011 | | JavaScript | $0.011 | 81% | $0.014 | | SQL | $0.008 | 83% | $0.010 | | Bash | $0.007 | 72% | $0.010 |
For Python, you pay about 1.1 cents per usable function. That's... really cheap. For comparison, the GitHub Copilot announcement mentioned they'd be pricing the service at roughly $10/month. If a developer uses it 100 times a day, that's about $0.003 per suggestion. So Copilot pricing is likely below Davinci rates, which confirms my guess about a cheaper Codex tier.
What surprised me
The thing I didn't expect: prompt engineering matters enormously for code generation. A vague docstring like "sort the list" produces mediocre code. But "sort a list of dictionaries by the 'date' key in descending order, handling missing keys by placing them last" produces almost perfect code.
I tested this explicitly with 50 paired requests (vague prompt vs detailed prompt, same task):
| Prompt specificity | Avg quality | Avg output tokens | Success rate | |-------------------|-------------|-------------------|-------------| | Vague (1 line) | 2.8 | 55 | 58% | | Detailed (2-3 lines) | 4.2 | 72 | 89% |
The detailed prompts cost about 30% more in tokens but produce usable code 53% more often. The cost per useful output is actually lower with longer, more specific prompts. More tokens in, fewer wasted generations.
This feels like an important principle for code generation economics: investing tokens in prompt quality has better ROI than generating more attempts with cheap prompts.
A small prediction
Code generation is going to get very cheap very fast. Right now, at Davinci rates, it costs about a penny per useful Python function. When OpenAI releases a dedicated Codex pricing tier (and they will), I expect that to drop to a fraction of a cent.
At that point, the question isn't "can you afford to use AI for coding?" It's "can you afford not to?"
I'll track the pricing when it launches. The spreadsheet is ready. It's always ready.
If you found this interesting, you might also like:
- Wait, GPT-3 costs HOW much per token?
- The GPT-3 API waitlist is 6 months long. Here's what the early data looks like.
- DALL-E's first images vs what people expected: a data comparison
- I counted every AI startup that raised money in Q1 2021. The numbers are strange.
-- dataku