Benchmark AnalysisJune 9, 20255 min read

The SWE-bench Verified leaderboard: who's actually solving real bugs?

SWE-bench Verified filters out the easy problems. I compared scores on full SWE-bench vs Verified for 12 models. Some models drop 20+ points. The gap reveals who's gaming the benchmark vs who's actually good at coding.

SWE-bench Verified is the coding benchmark I trust most right now. It filters out the easy, pattern-matchable problems from the full SWE-bench and keeps only the ones that require genuine understanding of real codebases.

I compared 12 models on both the full SWE-bench and SWE-bench Verified. The drop-off tells you who's actually good at coding and who's been optimizing for easy test cases.

Full SWE-bench vs Verified

| Model | SWE-bench (full) | SWE-bench Verified | Drop | Drop % | |-------|-------------------|-------------------|------|--------| | Claude Opus 4 | 71.2% | 58.7% | -12.5 | -18% | | Claude 4 Sonnet | 64.8% | 52.4% | -12.4 | -19% | | Devin (Cognition) | 68.3% | 48.2% | -20.1 | -29% | | GPT-4o | 52.8% | 33.2% | -19.6 | -37% | | Factory AI Drafter | 61.4% | 44.1% | -17.3 | -28% | | DeepSeek V3 | 56.1% | 42.0% | -14.1 | -25% | | DeepSeek R1 | 62.3% | 49.2% | -13.1 | -21% | | Gemini 2.5 Pro | 55.6% | 41.3% | -14.3 | -26% | | Qwen3 235B | 50.2% | 35.8% | -14.4 | -29% | | Llama 4 Maverick | 44.1% | 28.6% | -15.5 | -35% | | Grok 3 | 49.8% | 39.8% | -10.0 | -20% | | o3 | 64.1% | 51.8% | -12.3 | -19% |

Sources: SWE-bench leaderboard, Anthropic, OpenAI, Cognition Labs, model provider reports.

The "drop percentage" column is the key

A low drop percentage means the model is equally good on hard and easy problems. A high drop percentage means it was inflated by easy problems.

| Drop category | Models | Interpretation | |--------------|--------|---------------| | Low drop (< 20%) | Claude Opus 4, Claude 4 Sonnet, o3, Grok 3 | Genuinely good at coding | | Medium drop (20-28%) | DeepSeek R1, DeepSeek V3, Gemini 2.5 Pro | Good, with some easy-problem inflation | | High drop (29%+) | Devin, GPT-4o, Llama 4, Qwen3 | Significant easy-problem inflation |

Claude Opus 4 drops only 18%. It scores well on the hard problems, not just the easy ones. o3 also has a low drop (19%), suggesting its reasoning capability genuinely helps with hard bugs.

GPT-4o drops 37%. Its full SWE-bench score (52.8%) looks respectable, but the Verified score (33.2%) tells the real story. On the problems that actually require understanding codebases, it struggles.

Devin drops 29%. Its full score of 68.3% is impressive, but the 48.2% Verified score suggests significant optimization for the easier test cases in the benchmark.

What makes "Verified" problems harder?

| Characteristic | Full SWE-bench | Verified subset | |---------------|----------------|----------------| | Avg lines of code to change | 12 | 34 | | Files typically modified | 1-2 | 2-5 | | Requires understanding test suite | Sometimes | Almost always | | Requires understanding module interactions | Rarely | Usually | | Can be solved by pattern matching | Often | Rarely |

Sources: SWE-bench analysis, benchmark paper on arXiv.

The Verified problems require modifying more code across more files. They require understanding how different parts of the codebase interact. You can't solve them by recognizing the bug pattern and applying a template fix.

The agent scaffolding matters

| Model + Scaffold | SWE-bench Verified | |-----------------|-------------------| | Claude Opus 4 (with Anthropic's scaffold) | 58.7% | | Claude Opus 4 (bare API, simple loop) | 42.1% | | o3 (with OpenAI's scaffold) | 51.8% | | o3 (bare API) | 38.4% |

The same model scores 16+ points higher with good scaffolding. The agent framework, tool configuration, and prompting strategy matter almost as much as the base model.

This is why comparing SWE-bench scores across providers is tricky. You're not just comparing models. You're comparing the entire agent stack.

My recommendation

| If you care about... | Trust this benchmark | |---------------------|---------------------| | Real coding ability | SWE-bench Verified | | Competitive programming | Codeforces rating | | Basic code generation | HumanEval (with caveats) | | Overall model quality | Chatbot Arena (coding category) |

SWE-bench Verified isn't perfect. It's still a fixed set of repos and bugs, which means eventual contamination. But right now, it's the best signal we have for "can this model actually fix real bugs in real codebases?"

I trust the Verified leaderboard more than any other coding benchmark. The drop from full to Verified is the honesty check.


If you found this interesting, you might also like:

-- dataku

More from dataku