The cost per correct answer: a new way to compare models
Raw benchmark scores ignore cost. I calculated "cost per correct answer" across 500 questions for 10 models. The cheapest correct answer comes from Gemini 2.5 Flash at $0.0003. The most expensive is GPT-4.5 at $0.14. A 467x difference.
Benchmark leaderboards rank models by accuracy. Pricing pages rank them by cost. Nobody combines the two.
I think "cost per correct answer" is the metric that matters most for production use. So I calculated it for 10 models across 500 questions.
The methodology
500 questions from 5 categories (100 each): general knowledge, coding, math, analysis, and creative tasks. Each model gets one attempt. I record: correct/incorrect, tokens consumed, cost.
Cost per correct answer = total cost / number of correct answers.
The results
| Model | Accuracy | Avg cost per query | Cost per correct answer | Rank | |-------|----------|-------------------|----------------------|------| | Gemini 2.5 Flash | 78.4% | $0.00024 | $0.00031 | 1st (cheapest) | | GPT-4o mini | 76.2% | $0.00034 | $0.00045 | 2nd | | DeepSeek V3 | 82.6% | $0.0012 | $0.0015 | 3rd | | Qwen3 (via API) | 81.0% | $0.0016 | $0.0020 | 4th | | Gemini 2.5 Pro | 87.4% | $0.0068 | $0.0078 | 5th | | GPT-4o | 84.8% | $0.0072 | $0.0085 | 6th | | Claude 4 Sonnet | 88.2% | $0.0091 | $0.0103 | 7th | | Claude Opus 4 | 91.6% | $0.062 | $0.068 | 8th | | Grok 3 | 85.4% | $0.082 | $0.096 | 9th | | GPT-4.5 | 89.8% | $0.126 | $0.140 | 10th (most expensive) |
Sources: Anthropic, OpenAI, Google, DeepSeek, Artificial Analysis, my 500-question evaluation.
Gemini 2.5 Flash: $0.0003 per correct answer. GPT-4.5: $0.14 per correct answer. A 467x difference.
And Gemini Flash's accuracy (78.4%) is only 11 points below GPT-4.5's (89.8%). You pay 467x more for 11 percentage points of accuracy.
The cost-accuracy frontier
The interesting question: which models offer the best accuracy for their price tier?
| Price tier | Best model | Accuracy | Cost/correct | |-----------|-----------|----------|-------------| | Ultra-cheap (<$0.001/query) | Gemini 2.5 Flash | 78.4% | $0.00031 | | Cheap ($0.001-0.005) | DeepSeek V3 | 82.6% | $0.0015 | | Mid-range ($0.005-0.02) | Claude 4 Sonnet | 88.2% | $0.0103 | | Premium ($0.02+) | Claude Opus 4 | 91.6% | $0.068 |
Each tier represents a jump in quality. The question is whether you need that jump.
Going from Gemini Flash (78%) to DeepSeek V3 (83%) costs 5x more. Going from DeepSeek V3 to Claude 4 Sonnet (88%) costs 7x more. Going from Sonnet to Opus (92%) costs 7x more.
Each 5-point accuracy jump costs roughly 5-7x more. The curve is remarkably consistent.
By category
| Category | Cheapest correct answer | Most accurate model | Accuracy gap | |----------|----------------------|-------------------|-------------| | General knowledge | Gemini Flash ($0.00028) | GPT-4.5 (94%) | 16% | | Coding | GPT-4o mini ($0.00051) | Claude Opus 4 (95%) | 19% | | Math | DeepSeek V3 ($0.0018) | Claude Opus 4 (93%) | 14% | | Analysis | Gemini Flash ($0.00030) | Claude 4 Sonnet (90%) | 12% | | Creative | Gemini Flash ($0.00032) | Claude 4 Sonnet (91%) | 15% |
Coding has the largest gap between cheapest and best (19 points). This is the one category where paying for premium models shows the clearest quality difference.
For general knowledge and analysis, the cheap models are "good enough" for most applications (78%+ accuracy).
Why this metric matters
Traditional benchmark comparisons say: "Claude Opus 4 is the best model." True. It has 91.6% accuracy.
Cost-per-correct-answer says: "Gemini 2.5 Flash gives you 467x more correct answers per dollar." Also true.
| Decision | Use accuracy | Use cost-per-correct | |----------|-------------|---------------------| | "Which model is smartest?" | Accuracy ranking | N/A | | "Which model should I deploy at scale?" | N/A | Cost-per-correct ranking | | "Where should I invest in quality?" | By category, accuracy | By category, cost-per-correct |
For production systems processing millions of queries, cost-per-correct-answer is the only metric that ties to business economics. A model that's 5% more accurate but 20x more expensive is a bad deal for classification. It might be a great deal for surgery.
My spreadsheet now has a "cost per correct answer" column for every model. It changed how I think about model selection. I hope it changes how you think about it too.
If you found this interesting, you might also like:
- Google Gemini benchmarks vs GPT-4: reading the fine print
- Gemini 1.5 Pro has a 1 million token context window. I tested it with real documents.
- The benchmark contamination problem is getting worse. New evidence.
- Every AI benchmark from 2020, ranked by how much they actually tell you
- DALL-E 2 is out. I ran 200 prompts and measured the results.
-- dataku