Google Gemini 2.0 Flash: the speed-to-quality ratio is unprecedented

Google just made the model I didn't know I was waiting for.

Gemini 2.0 Flash launched on December 11th. The pitch: multimodal (text, vision, audio, video), fast, and cheap. That's a pitch I've heard before from Google. This time, the data backs it up.

The standard evaluation

| Category | Gemini 2.0 Flash | GPT-4o | Claude 3.5 Sonnet (Oct) | Gemini 1.5 Flash | |----------|-----------------|--------|------------------------|-----------------| | Factual Q&A (50) | 4.14 | 4.18 | 4.34 | 3.72 | | Code generation (50) | 4.22 | 4.28 | 4.62 | 3.64 | | Creative writing (50) | 3.92 | 4.02 | 4.36 | 3.48 | | Summarization (50) | 4.18 | 4.14 | 4.42 | 3.82 | | Reasoning (50) | 4.16 | 4.22 | 4.48 | 3.58 | | Instruction following (50) | 4.22 | 4.24 | 4.46 | 3.74 | | Overall | 4.14 | 4.18 | 4.45 | 3.66 |

Source: My evaluation, 300 prompts, blind rating, December 2024.

Gemini 2.0 Flash scores 4.14 overall. GPT-4o scores 4.18. A gap of 0.04 points.

That's within noise for my evaluation methodology. At 50 prompts per category, a 0.04 difference is not statistically significant. Gemini 2.0 Flash and GPT-4o are, by my measurements, essentially the same quality.

But look at the jump from Gemini 1.5 Flash: 3.66 to 4.14. A 0.48 point improvement in one generation. Google's Flash line went from "budget model" to "nearly flagship" in a single release.

Speed comparison

I ran the same 100-request speed test I use for all providers:

| Model | Provider | Median TTFT | Median tok/sec | p99 TTFT | |-------|----------|------------|---------------|---------| | Gemini 2.0 Flash | Google | 142ms | 198 | 412ms | | GPT-4o | OpenAI | 289ms | 78 | 890ms | | Claude 3.5 Sonnet (Oct) | Anthropic | 324ms | 82 | 1,240ms | | GPT-4o mini | OpenAI | 178ms | 128 | 534ms | | Gemini 1.5 Flash | Google | 198ms | 145 | 678ms |

Source: My speed tests, 100 requests per model, December 2024.

Gemini 2.0 Flash at 198 tokens/second is 2.5x faster than GPT-4o (78 tok/sec) and 2.4x faster than Claude 3.5 Sonnet (82 tok/sec). It's even faster than GPT-4o mini (128 tok/sec).

And the TTFT (time to first token) at 142ms is half of GPT-4o's 289ms. The model starts responding noticeably faster.

The price-quality-speed triangle

Here's where it gets compelling. I don't normally get to show a model winning on all three dimensions:

| Model | My eval score | Output $/M tokens | Tokens/sec | Score per dollar | Score per speed unit | |-------|-------------|-------------------|-----------|-----------------|---------------------| | Gemini 2.0 Flash | 4.14 | ~$0.30 est. | 198 | 13.80 | 0.021 | | GPT-4o | 4.18 | $15.00 | 78 | 0.28 | 0.054 | | Claude 3.5 Sonnet | 4.45 | $15.00 | 82 | 0.30 | 0.054 | | GPT-4o mini | 3.84 | $0.60 | 128 | 6.40 | 0.030 | | Gemini 1.5 Flash | 3.66 | $0.30 | 145 | 12.20 | 0.025 |

Source: My evaluation data, provider pricing, speed tests, December 2024.

Score per dollar: Gemini 2.0 Flash delivers 13.80 quality points per dollar. GPT-4o delivers 0.28. That's a 49x value advantage.

And it's not just cheap. It's also fast. 198 tokens/second puts it in the speed tier just below the specialized inference chips (Groq, Cerebras).

The only model that beats Gemini 2.0 Flash on absolute quality is Claude 3.5 Sonnet (4.45 vs 4.14). But Claude 3.5 Sonnet costs 50x more and is 2.4x slower.

Multimodal quality

Gemini 2.0 Flash is natively multimodal. I tested image understanding:

| Image task | Gemini 2.0 Flash | GPT-4o | Claude 3.5 Sonnet | |-----------|-----------------|--------|-------------------| | Describe scene (20 images) | 4.08/5 | 4.18/5 | 4.02/5 | | Read text in image (20 images) | 4.24/5 | 4.12/5 | 3.88/5 | | Chart/graph interpretation (20 images) | 4.32/5 | 4.28/5 | 4.14/5 | | Object counting (20 images) | 3.84/5 | 3.92/5 | 3.72/5 | | Image average | 4.12 | 4.13 | 3.94 |

Source: My evaluation, 80 image tasks, December 2024.

Essentially tied with GPT-4o on image understanding (4.12 vs 4.13). Beats Claude 3.5 Sonnet (3.94). And notably strong on text-in-image reading (4.24, best of the three), which matters for document processing and OCR-like tasks.

What Google got right this time

Google's previous AI model releases had a pattern: impressive benchmarks on paper, underwhelming in practice. Gemini 1.0 Pro was mediocre. Gemini 1.0 Ultra was good but slow and expensive. Gemini 1.5 Pro was solid but not best-in-class.

Gemini 2.0 Flash breaks the pattern. Here's what changed:

| Previous Google pattern | Gemini 2.0 Flash | |------------------------|------------------| | Good benchmarks, weak in practice | Benchmarks match real-world quality | | Slow API, high latency | 198 tok/sec, 142ms TTFT (competitive with Groq tier) | | Confusing pricing tiers | Simple: cheap (Flash pricing) | | Multimodal as afterthought | Multimodal from day one, genuinely good |

Google has massive infrastructure advantages (custom TPUs, global data centers, enormous bandwidth). Previous Gemini models didn't fully exploit those advantages. Gemini 2.0 Flash does.

Updated model recommendations

| Use case | Previous recommendation | New recommendation | |----------|----------------------|-------------------| | Best quality, cost no object | Claude 3.5 Sonnet | Claude 3.5 Sonnet (unchanged) | | Best value, general use | GPT-4o mini | Gemini 2.0 Flash | | Fastest API | Groq (Llama 3.1 70B) | Gemini 2.0 Flash or Groq | | Image/document processing | GPT-4o | Gemini 2.0 Flash | | Budget chatbot | Gemini 1.5 Flash | Gemini 2.0 Flash |

Gemini 2.0 Flash just became my default recommendation for most use cases. The combination of GPT-4o-level quality at Flash-level pricing with 198 tok/sec speed is hard to beat.

Claude 3.5 Sonnet is still the best model overall (4.45 vs 4.14). If you need that last 0.31 points of quality, it's worth the 50x price premium. For the other 90% of use cases, Gemini 2.0 Flash is the answer.

Google finally built the model that matches their infrastructure. My spreadsheet is smiling.

If you found this interesting, you might also like:

-- dataku

Google Gemini 2.0 Flash: the speed-to-quality ratio is unprecedented

The standard evaluation

Speed comparison

The price-quality-speed triangle

Multimodal quality

What Google got right this time

Updated model recommendations

More from dataku

Claude Opus 4.6 review: the 1M context model

o4-mini vs Claude 4 Sonnet vs Gemini 2.5 Flash: the speed tier showdown

Gemini 2.5 Ultra: Google's best model vs the field