o4-mini vs Claude 4 Sonnet vs Gemini 2.5 Flash: the speed tier showdown

Flagship models get the attention. Speed-tier models get the deployments.

OpenAI's o4-mini, Anthropic's Claude 4 Sonnet, and Google's Gemini 2.5 Flash are the three models that most production applications actually use. I compared them on what matters: speed, cost, and quality at scale.

The specs

| Spec | o4-mini | Claude 4 Sonnet | Gemini 2.5 Flash | |------|---------|----------------|-----------------| | Input/M tokens | $1.10 | $2.50 | $0.05 | | Output/M tokens | $4.40 | $12.50 | $0.20 | | Context window | 128K | 200K | 1M | | Speed (tokens/sec) | 120 t/s | 95 t/s | 340 t/s | | TTFT | 150ms | 220ms | 120ms |

Sources: OpenAI, Anthropic, Google pricing and performance data, Artificial Analysis.

Gemini Flash is 50x cheaper than Claude Sonnet on input tokens and 62x cheaper on output tokens. o4-mini falls in the middle.

Speed: Gemini Flash at 340 t/s is 3.6x faster than Claude Sonnet (95 t/s). For latency-sensitive applications, this matters a lot.

200-task evaluation

| Category (40 tasks each) | o4-mini | Claude 4 Sonnet | Gemini 2.5 Flash | |--------------------------|---------|----------------|-----------------| | Classification | 88% | 90% | 87% | | Extraction | 86% | 92% | 84% | | Code generation | 84% | 90% | 78% | | Summarization | 82% | 88% | 86% | | Reasoning | 80% | 86% | 76% | | Overall | 84.0% | 89.2% | 82.2% |

Claude 4 Sonnet leads quality across all 5 categories. o4-mini is second at 84%. Gemini Flash is third at 82.2%.

The gap: Claude is 5.2 points above o4-mini and 7 points above Gemini Flash.

Cost per 1,000 queries

| Task type | o4-mini | Claude 4 Sonnet | Gemini 2.5 Flash | |-----------|---------|----------------|-----------------| | Classification (short) | $0.22 | $0.48 | $0.008 | | Extraction (medium) | $0.44 | $0.95 | $0.016 | | Code generation (long) | $1.32 | $3.75 | $0.060 | | Summarization (medium) | $0.66 | $1.56 | $0.024 | | Reasoning (long) | $1.54 | $4.38 | $0.068 |

Gemini Flash is an order of magnitude cheaper on every task. 1,000 classification queries cost less than a penny ($0.008).

Cost per correct answer

| Category | o4-mini | Claude 4 Sonnet | Gemini 2.5 Flash | |----------|---------|----------------|-----------------| | Classification | $0.00025 | $0.00053 | $0.000092 | | Extraction | $0.00051 | $0.0010 | $0.00019 | | Code generation | $0.0016 | $0.0042 | $0.00077 | | Summarization | $0.0008 | $0.0018 | $0.00028 | | Reasoning | $0.0019 | $0.0051 | $0.00090 |

Even adjusting for accuracy, Gemini Flash wins cost-per-correct-answer in every category. Its lower accuracy (82.2% vs 89.2%) doesn't offset the 50x price advantage.

The speed test

| Metric | o4-mini | Claude 4 Sonnet | Gemini 2.5 Flash | |--------|---------|----------------|-----------------| | 100 queries batch time | 28s | 42s | 12s | | P95 latency (single query) | 2.1s | 3.8s | 1.2s | | Queries per minute (sustained) | 210 | 140 | 480 |

Gemini Flash processes 480 queries per minute vs Claude Sonnet's 140. For high-throughput applications (real-time classification, live data processing), Flash is the only viable option.

Decision framework

| Your priority | Best choice | Why | |--------------|------------|-----| | Maximum quality (speed tier) | Claude 4 Sonnet | +5-7 points over alternatives | | Minimum cost | Gemini 2.5 Flash | 50x cheaper, 82%+ quality | | Best coding | Claude 4 Sonnet | 90% vs 84% vs 78% | | Highest throughput | Gemini 2.5 Flash | 3.4x faster than Sonnet | | Balanced | o4-mini | Mid-range on all metrics | | Classification at scale | Gemini 2.5 Flash | $0.008 per 1K queries | | Customer-facing chat | Claude 4 Sonnet | Best instruction following |

For most production deployments processing millions of queries, Gemini Flash wins. The 7-point quality gap to Claude Sonnet rarely matters for classification, extraction, or summarization at scale.

For coding and customer-facing applications where quality directly impacts user experience, Claude Sonnet justifies its premium.

o4-mini feels like the "compromise" option that nobody is excited about but everybody considers.

The speed tier is where the actual money gets spent in AI. And the winner depends entirely on whether you're optimizing for cost or quality. There's no model that wins both.

If you found this interesting, you might also like:

-- dataku

o4-mini vs Claude 4 Sonnet vs Gemini 2.5 Flash: the speed tier showdown

The specs

200-task evaluation

Cost per 1,000 queries

Cost per correct answer

The speed test

Decision framework

More from dataku

Claude Opus 4.6 review: the 1M context model

Gemini 2.5 Ultra: Google's best model vs the field

DeepSeek R2: the open source reasoning model that costs pennies