Model ComparisonsMarch 10, 20264 min read

o4-mini vs Claude 4 Sonnet vs Gemini 2.5 Flash: the speed tier showdown

The "fast and cheap" tier is where the real competition is. I compared the three on 200 tasks optimizing for speed and cost, not peak quality. Gemini Flash wins on price. o4-mini wins on coding. Claude Sonnet wins on general quality.

Flagship models get the attention. Speed-tier models get the deployments.

OpenAI's o4-mini, Anthropic's Claude 4 Sonnet, and Google's Gemini 2.5 Flash are the three models that most production applications actually use. I compared them on what matters: speed, cost, and quality at scale.

The specs

| Spec | o4-mini | Claude 4 Sonnet | Gemini 2.5 Flash | |------|---------|----------------|-----------------| | Input/M tokens | $1.10 | $2.50 | $0.05 | | Output/M tokens | $4.40 | $12.50 | $0.20 | | Context window | 128K | 200K | 1M | | Speed (tokens/sec) | 120 t/s | 95 t/s | 340 t/s | | TTFT | 150ms | 220ms | 120ms |

Sources: OpenAI, Anthropic, Google pricing and performance data, Artificial Analysis.

Gemini Flash is 50x cheaper than Claude Sonnet on input tokens and 62x cheaper on output tokens. o4-mini falls in the middle.

Speed: Gemini Flash at 340 t/s is 3.6x faster than Claude Sonnet (95 t/s). For latency-sensitive applications, this matters a lot.

200-task evaluation

| Category (40 tasks each) | o4-mini | Claude 4 Sonnet | Gemini 2.5 Flash | |--------------------------|---------|----------------|-----------------| | Classification | 88% | 90% | 87% | | Extraction | 86% | 92% | 84% | | Code generation | 84% | 90% | 78% | | Summarization | 82% | 88% | 86% | | Reasoning | 80% | 86% | 76% | | Overall | 84.0% | 89.2% | 82.2% |

Claude 4 Sonnet leads quality across all 5 categories. o4-mini is second at 84%. Gemini Flash is third at 82.2%.

The gap: Claude is 5.2 points above o4-mini and 7 points above Gemini Flash.

Cost per 1,000 queries

| Task type | o4-mini | Claude 4 Sonnet | Gemini 2.5 Flash | |-----------|---------|----------------|-----------------| | Classification (short) | $0.22 | $0.48 | $0.008 | | Extraction (medium) | $0.44 | $0.95 | $0.016 | | Code generation (long) | $1.32 | $3.75 | $0.060 | | Summarization (medium) | $0.66 | $1.56 | $0.024 | | Reasoning (long) | $1.54 | $4.38 | $0.068 |

Gemini Flash is an order of magnitude cheaper on every task. 1,000 classification queries cost less than a penny ($0.008).

Cost per correct answer

| Category | o4-mini | Claude 4 Sonnet | Gemini 2.5 Flash | |----------|---------|----------------|-----------------| | Classification | $0.00025 | $0.00053 | $0.000092 | | Extraction | $0.00051 | $0.0010 | $0.00019 | | Code generation | $0.0016 | $0.0042 | $0.00077 | | Summarization | $0.0008 | $0.0018 | $0.00028 | | Reasoning | $0.0019 | $0.0051 | $0.00090 |

Even adjusting for accuracy, Gemini Flash wins cost-per-correct-answer in every category. Its lower accuracy (82.2% vs 89.2%) doesn't offset the 50x price advantage.

The speed test

| Metric | o4-mini | Claude 4 Sonnet | Gemini 2.5 Flash | |--------|---------|----------------|-----------------| | 100 queries batch time | 28s | 42s | 12s | | P95 latency (single query) | 2.1s | 3.8s | 1.2s | | Queries per minute (sustained) | 210 | 140 | 480 |

Gemini Flash processes 480 queries per minute vs Claude Sonnet's 140. For high-throughput applications (real-time classification, live data processing), Flash is the only viable option.

Decision framework

| Your priority | Best choice | Why | |--------------|------------|-----| | Maximum quality (speed tier) | Claude 4 Sonnet | +5-7 points over alternatives | | Minimum cost | Gemini 2.5 Flash | 50x cheaper, 82%+ quality | | Best coding | Claude 4 Sonnet | 90% vs 84% vs 78% | | Highest throughput | Gemini 2.5 Flash | 3.4x faster than Sonnet | | Balanced | o4-mini | Mid-range on all metrics | | Classification at scale | Gemini 2.5 Flash | $0.008 per 1K queries | | Customer-facing chat | Claude 4 Sonnet | Best instruction following |

For most production deployments processing millions of queries, Gemini Flash wins. The 7-point quality gap to Claude Sonnet rarely matters for classification, extraction, or summarization at scale.

For coding and customer-facing applications where quality directly impacts user experience, Claude Sonnet justifies its premium.

o4-mini feels like the "compromise" option that nobody is excited about but everybody considers.

The speed tier is where the actual money gets spent in AI. And the winner depends entirely on whether you're optimizing for cost or quality. There's no model that wins both.


If you found this interesting, you might also like:

-- dataku

More from dataku