Model ComparisonsApril 10, 20237 min read

Claude vs GPT-4: my first head-to-head data comparison

Anthropic's Claude is in beta and I got access. I ran both models through 300 prompts across coding, writing, and reasoning. Claude wins on length and nuance. GPT-4 wins on accuracy. The data is tight.

I've been on the Anthropic Claude waitlist since December. Access came through in late March, and I've spent two weeks running it through the same battery of tests I use for every model.

The short version: this is a real competition. Claude isn't a GPT-4 clone and it isn't a distant second place. It's a genuinely different model with different strengths, and the data is closer than I expected.

Test setup

300 prompts, split across 6 categories. I ran every prompt on both Claude v1.3 and GPT-4 (8K context). Blind evaluation: I read both outputs without knowing which model produced which, and rated each on a 1-5 scale for quality.

| Category | # prompts | What I tested | |----------|-----------|---------------| | Coding (Python) | 50 | Bug fixes, function writing, code explanation | | Creative writing | 50 | Short stories, essays, marketing copy | | Factual Q&A | 50 | Specific, verifiable questions | | Summarization | 50 | Condense 2000-word articles to 3 paragraphs | | Reasoning | 50 | Logic puzzles, multi-step word problems | | Instruction following | 50 | Complex multi-part instructions with constraints |

Overall results

| Metric | Claude v1.3 | GPT-4 | Winner | |--------|------------|-------|--------| | Avg quality (1-5 scale) | 3.82 | 4.11 | GPT-4 | | Win rate (head-to-head) | 31.3% | 44.0% | GPT-4 | | Tie rate | 24.7% | 24.7% | -- | | Avg response length (tokens) | 487 | 312 | Claude (longer) | | Refusal rate | 6.0% | 3.3% | GPT-4 (fewer refusals) |

GPT-4 wins overall. But 31.3% vs 44.0% win rate means Claude is taking nearly a third of the matchups. This isn't a blowout.

Per-category breakdown

This is where it gets interesting.

| Category | Claude avg | GPT-4 avg | Claude win % | GPT-4 win % | Tie % | |----------|-----------|-----------|-------------|-------------|-------| | Coding | 3.4 | 4.3 | 18% | 60% | 22% | | Creative writing | 4.1 | 3.9 | 42% | 30% | 28% | | Factual Q&A | 3.7 | 4.2 | 24% | 50% | 26% | | Summarization | 4.2 | 4.0 | 40% | 28% | 32% | | Reasoning | 3.5 | 4.3 | 16% | 58% | 26% | | Instruction following | 4.0 | 3.9 | 38% | 34% | 28% |

Look at those numbers carefully. Claude beats GPT-4 on three categories: creative writing, summarization, and instruction following. GPT-4 beats Claude on coding, factual Q&A, and reasoning.

Claude's wins aren't marginal. A 42% win rate on creative writing versus GPT-4's 30% is significant. Claude produces longer, more detailed, more thoughtfully structured writing. It builds paragraphs with better rhythm. It follows stylistic instructions more faithfully.

GPT-4's wins are bigger in magnitude though. That 60% win rate on coding is decisive. GPT-4 writes code that compiles more often, catches edge cases better, and provides more accurate explanations.

The response length difference

I didn't expect this to matter as much as it did.

| Category | Claude avg tokens | GPT-4 avg tokens | Ratio | |----------|------------------|------------------|-------| | Coding | 392 | 284 | 1.38x | | Creative writing | 621 | 347 | 1.79x | | Factual Q&A | 418 | 256 | 1.63x | | Summarization | 389 | 312 | 1.25x | | Reasoning | 487 | 298 | 1.63x | | Instruction following | 614 | 378 | 1.62x |

Claude gives longer answers across the board. 1.25x to 1.79x longer. Some people will love this. Some will hate it.

For creative writing and complex instructions, the extra length is genuinely useful. Claude fills in details, provides more examples, and develops ideas more fully. For factual Q&A, the extra length is sometimes unnecessary padding. GPT-4's concise answers are often better.

The pricing implication is real too. If Claude produces 1.6x more output tokens on average, and you're paying per token, your effective cost is higher than the per-token price suggests. Even if Claude's per-token price is lower, the total bill might be comparable.

Where Claude surprised me

Instruction following with constraints. I gave both models prompts like "Write a 200-word product description using exactly 3 bullet points, no adjectives, in second person." Claude hit the constraints more consistently. GPT-4 often ignored one or two constraints while producing higher-quality text.

| Constraint adherence | Claude | GPT-4 | |---------------------|--------|-------| | Word count within 10% | 72% | 54% | | Format constraints (bullets, headers) | 88% | 76% | | Style constraints (no adjectives, etc.) | 64% | 48% | | All constraints met | 52% | 38% |

This matters for production applications where you need predictable output format. If your prompt says "return JSON with these fields," Claude is more likely to give you exactly that.

Admitting uncertainty. I noticed this qualitatively and then went back to count it. When asked factual questions where the answer is genuinely uncertain or unknown:

| Behavior | Claude | GPT-4 | |----------|--------|-------| | Says "I'm not sure" or equivalent | 34% | 18% | | Gives confident but wrong answer | 12% | 22% | | Gives correct answer | 48% | 56% | | Refuses to answer | 6% | 4% |

Claude admits uncertainty almost twice as often as GPT-4. GPT-4 gives more correct answers (56% vs 48%) but also gives more confidently wrong answers (22% vs 12%). Depending on your use case, you might prefer Claude's honesty over GPT-4's higher hit rate.

The pricing comparison at the time of testing

| Model | Input ($/1K tokens) | Output ($/1K tokens) | My avg cost per prompt | |-------|---------------------|----------------------|----------------------| | Claude v1.3 | $0.008 | $0.024 | $0.016 | | GPT-4 (8K) | $0.030 | $0.060 | $0.026 |

Sources: Anthropic pricing, OpenAI pricing, April 2023.

Claude is about 40% cheaper per prompt in my testing. Combined with competitive quality on writing and summarization tasks, it's genuinely a viable alternative for non-coding use cases.

My overall read

GPT-4 is still the best model overall. But "overall" hides a lot of variation. If I were building a writing tool, I'd seriously consider Claude. If I were building a coding tool, GPT-4, no contest.

The most interesting thing about this comparison isn't who wins. It's that there IS a real comparison to make. Three months ago, GPT-4 had no serious competitor. Now there's Claude in beta, Llama derivatives improving weekly, and Google's PaLM 2 coming soon.

The LMSYS Chatbot Arena Elo ratings confirm what my data shows: Claude is in the conversation. And the Hugging Face Open LLM Leaderboard is filling up with new contenders every week.

Competition is good. For users, this means better models and lower prices. For my spreadsheets, it means more work. I'm okay with both.


If you found this interesting, you might also like:

-- dataku

More from dataku