Claude Opus 4 vs GPT-4o vs Gemini 2.5 Pro: the definitive Q4 comparison

500 prompts. 8 categories. 3 human raters (including me). Three models.

This is my most thorough frontier model comparison. I wanted to answer, definitively, which model is "best" in November 2025.

The answer: it depends. But let me show you exactly how it depends.

The models

| Model | Provider | Input/M | Output/M | Context | Released | |-------|----------|---------|----------|---------|---------| | Claude Opus 4 | Anthropic | $15.00 | $75.00 | 200K | Apr 2025 | | GPT-4o (Nov update) | OpenAI | $2.50 | $10.00 | 128K | Nov 2025 | | Gemini 2.5 Pro | Google | $1.25 | $10.00 | 2M | Feb 2025 |

Results by category (500 prompts, 8 categories)

| Category | Claude Opus 4 | GPT-4o | Gemini 2.5 Pro | Winner | |----------|--------------|--------|---------------|--------| | Coding (Python, 75 prompts) | 92% | 84% | 88% | Claude | | Coding (other langs, 50 prompts) | 88% | 82% | 84% | Claude | | Analysis/reasoning (75 prompts) | 86% | 80% | 84% | Claude | | Creative writing (50 prompts) | 88% | 78% | 76% | Claude | | Factual Q&A (50 prompts) | 86% | 90% | 88% | GPT-4o | | Multimodal/vision (50 prompts) | 84% | 82% | 86% | Gemini | | Long context (50 prompts) | 80% | 72% | 92% | Gemini | | Instruction following (100 prompts) | 90% | 86% | 84% | Claude | | Overall (500 prompts) | 87.4% | 82.4% | 85.0% | Claude |

Sources: My evaluation with 3 human raters. Scores are inter-rater agreement-adjusted.

Claude Opus 4 wins 5 of 8 categories and takes the overall crown at 87.4%. But look at where it loses: factual Q&A (GPT-4o wins), multimodal (Gemini wins), and long context (Gemini wins by 12 points).

The category that matters most to you

| If you care about... | Best model | Gap to 2nd | |---------------------|-----------|-----------| | Python coding | Claude Opus 4 | +4 over Gemini | | Factual accuracy | GPT-4o | +2 over Gemini | | Long documents | Gemini 2.5 Pro | +12 over Claude | | Image understanding | Gemini 2.5 Pro | +2 over Claude | | Creative writing | Claude Opus 4 | +10 over GPT-4o | | Instruction following | Claude Opus 4 | +4 over GPT-4o | | All-around | Claude Opus 4 | +2.4 over Gemini |

Speed comparison

| Metric | Claude Opus 4 | GPT-4o | Gemini 2.5 Pro | |--------|--------------|--------|---------------| | Time to first token | 310ms | 195ms | 180ms | | Tokens per second | 72 t/s | 92 t/s | 105 t/s | | Avg response time (300 tokens) | 4.5s | 3.5s | 3.1s |

Sources: My latency measurements, US East, November 2025.

Gemini is fastest. GPT-4o is second. Claude is slowest.

For interactive applications, the ~1.4 second gap between Claude (4.5s) and Gemini (3.1s) is noticeable. For batch processing, it matters less.

Cost per quality point

| Model | Avg quality | Cost per query | Cost per quality point | |-------|-----------|----------------|----------------------| | Claude Opus 4 | 87.4% | $0.068 | $0.078 | | GPT-4o | 82.4% | $0.0072 | $0.0087 | | Gemini 2.5 Pro | 85.0% | $0.0068 | $0.0080 |

Claude is the best model. It's also 9x more expensive per quality point than GPT-4o. The 5-point quality advantage costs a lot.

Gemini 2.5 Pro offers nearly Claude-level quality (2.4 points less) at Claude-like cost-per-quality-point. It's the best value of the three for "pretty good at everything."

The verdict

| Decision framework | Choose | |-------------------|--------| | Budget is unlimited, quality is paramount | Claude Opus 4 | | Need the fastest responses | Gemini 2.5 Pro | | Need the best factual accuracy | GPT-4o | | Working with very long documents | Gemini 2.5 Pro | | Writing code all day | Claude Opus 4 | | Need the best value (quality per dollar) | GPT-4o or Gemini 2.5 Pro | | Want one model for everything | Claude Opus 4 (if you can afford it) |

There is no "best model" in November 2025. There are three excellent models, each with clear strengths.

The era of "just use GPT-4 for everything" ended sometime in 2024. We're now in the era of model routing: use the right model for the right task. The companies that figure this out will spend 3-5x less than those that don't.

This was the most time-intensive evaluation I've done. Three weeks, 500 prompts, 1,500 ratings. My raters are exhausted. My spreadsheet is beautiful.

If you found this interesting, you might also like:

Claude 4 Sonnet vs GPT-4o vs Gemini 2.5 Flash: the mid-tier model war
Claude vs GPT-4: my first head-to-head data comparison
Mistral Large vs GPT-4 vs Claude 3 Opus: the three-way benchmark
[Gemini 2.5 Pro and ](/blog/gemini-2-5-pro-thinking-models-google-answer-o1)
GPT-3 vs GPT-J: the first real open source challenger, in data

-- dataku

Claude Opus 4 vs GPT-4o vs Gemini 2.5 Pro: the definitive Q4 comparison

The models

Results by category (500 prompts, 8 categories)

The category that matters most to you

Speed comparison

Cost per quality point

The verdict

More from dataku

Claude Opus 4.6 review: the 1M context model

o4-mini vs Claude 4 Sonnet vs Gemini 2.5 Flash: the speed tier showdown

Gemini 2.5 Ultra: Google's best model vs the field