The cost per correct answer: a new way to compare models

Benchmark leaderboards rank models by accuracy. Pricing pages rank them by cost. Nobody combines the two.

I think "cost per correct answer" is the metric that matters most for production use. So I calculated it for 10 models across 500 questions.

The methodology

500 questions from 5 categories (100 each): general knowledge, coding, math, analysis, and creative tasks. Each model gets one attempt. I record: correct/incorrect, tokens consumed, cost.

Cost per correct answer = total cost / number of correct answers.

The results

| Model | Accuracy | Avg cost per query | Cost per correct answer | Rank | |-------|----------|-------------------|----------------------|------| | Gemini 2.5 Flash | 78.4% | $0.00024 | $0.00031 | 1st (cheapest) | | GPT-4o mini | 76.2% | $0.00034 | $0.00045 | 2nd | | DeepSeek V3 | 82.6% | $0.0012 | $0.0015 | 3rd | | Qwen3 (via API) | 81.0% | $0.0016 | $0.0020 | 4th | | Gemini 2.5 Pro | 87.4% | $0.0068 | $0.0078 | 5th | | GPT-4o | 84.8% | $0.0072 | $0.0085 | 6th | | Claude 4 Sonnet | 88.2% | $0.0091 | $0.0103 | 7th | | Claude Opus 4 | 91.6% | $0.062 | $0.068 | 8th | | Grok 3 | 85.4% | $0.082 | $0.096 | 9th | | GPT-4.5 | 89.8% | $0.126 | $0.140 | 10th (most expensive) |

Sources: Anthropic, OpenAI, Google, DeepSeek, Artificial Analysis, my 500-question evaluation.

Gemini 2.5 Flash: $0.0003 per correct answer. GPT-4.5: $0.14 per correct answer. A 467x difference.

And Gemini Flash's accuracy (78.4%) is only 11 points below GPT-4.5's (89.8%). You pay 467x more for 11 percentage points of accuracy.

The cost-accuracy frontier

The interesting question: which models offer the best accuracy for their price tier?

| Price tier | Best model | Accuracy | Cost/correct | |-----------|-----------|----------|-------------| | Ultra-cheap (<$0.001/query) | Gemini 2.5 Flash | 78.4% | $0.00031 | | Cheap ($0.001-0.005) | DeepSeek V3 | 82.6% | $0.0015 | | Mid-range ($0.005-0.02) | Claude 4 Sonnet | 88.2% | $0.0103 | | Premium ($0.02+) | Claude Opus 4 | 91.6% | $0.068 |

Each tier represents a jump in quality. The question is whether you need that jump.

Going from Gemini Flash (78%) to DeepSeek V3 (83%) costs 5x more. Going from DeepSeek V3 to Claude 4 Sonnet (88%) costs 7x more. Going from Sonnet to Opus (92%) costs 7x more.

Each 5-point accuracy jump costs roughly 5-7x more. The curve is remarkably consistent.

By category

| Category | Cheapest correct answer | Most accurate model | Accuracy gap | |----------|----------------------|-------------------|-------------| | General knowledge | Gemini Flash ($0.00028) | GPT-4.5 (94%) | 16% | | Coding | GPT-4o mini ($0.00051) | Claude Opus 4 (95%) | 19% | | Math | DeepSeek V3 ($0.0018) | Claude Opus 4 (93%) | 14% | | Analysis | Gemini Flash ($0.00030) | Claude 4 Sonnet (90%) | 12% | | Creative | Gemini Flash ($0.00032) | Claude 4 Sonnet (91%) | 15% |

Coding has the largest gap between cheapest and best (19 points). This is the one category where paying for premium models shows the clearest quality difference.

For general knowledge and analysis, the cheap models are "good enough" for most applications (78%+ accuracy).

Why this metric matters

Traditional benchmark comparisons say: "Claude Opus 4 is the best model." True. It has 91.6% accuracy.

Cost-per-correct-answer says: "Gemini 2.5 Flash gives you 467x more correct answers per dollar." Also true.

| Decision | Use accuracy | Use cost-per-correct | |----------|-------------|---------------------| | "Which model is smartest?" | Accuracy ranking | N/A | | "Which model should I deploy at scale?" | N/A | Cost-per-correct ranking | | "Where should I invest in quality?" | By category, accuracy | By category, cost-per-correct |

For production systems processing millions of queries, cost-per-correct-answer is the only metric that ties to business economics. A model that's 5% more accurate but 20x more expensive is a bad deal for classification. It might be a great deal for surgery.

My spreadsheet now has a "cost per correct answer" column for every model. It changed how I think about model selection. I hope it changes how you think about it too.

If you found this interesting, you might also like:

-- dataku

The cost per correct answer: a new way to compare models

The methodology

The results

The cost-accuracy frontier

By category

Why this metric matters

More from dataku

My monthly benchmark dashboard: March 2026 update

Claude Opus 4.5: Anthropic's latest flagship, benchmarked

The state of AI benchmarks in early 2026: what still works?