The SWE-bench problem: are coding benchmarks measuring the right thing?
Every new model touts its SWE-bench score. I analyzed the test cases and found 23% of them can be 'solved' by a simple regex patch. The benchmark isn't wrong exactly, but it's not measuring what you think.
Every new AI model announcement in 2024 includes a SWE-bench score. "Our model resolves 43% of real GitHub issues!" "New SOTA on SWE-bench Verified!"
I spent two weeks analyzing the actual SWE-bench test cases. What I found makes me question what these scores really mean.
What SWE-bench is (and isn't)
SWE-bench is a benchmark created by Princeton researchers. It takes real GitHub issues from popular Python repos and asks models to generate patches that resolve them. The evaluation is binary: either the patch passes the repo's test suite or it doesn't.
The dataset:
- 2,294 total instances in SWE-bench
- 500 instances in SWE-bench Verified (human-validated subset)
- Drawn from 12 popular Python repos (django, flask, sympy, scikit-learn, etc.)
- Each instance: a GitHub issue + the repo state before the fix
This sounds great. Real issues. Real code. Binary pass/fail. No subjective ratings.
But when I dug into the actual test cases, the picture got complicated.
The difficulty distribution problem
I categorized 200 randomly selected SWE-bench instances by complexity:
| Complexity level | Count | Percentage | Description | |-----------------|-------|-----------|-------------| | Trivial (1-line fix) | 47 | 23.5% | Typo, missing import, wrong default value | | Simple (2-5 lines) | 58 | 29.0% | Add a parameter, fix a conditional, small logic bug | | Moderate (6-20 lines) | 52 | 26.0% | Refactor a function, add error handling, new feature | | Complex (20-100 lines) | 31 | 15.5% | Multi-file change, architectural fix, new API | | Very complex (100+ lines) | 12 | 6.0% | Major feature, redesign, cross-cutting concern |
Source: My manual categorization of 200 SWE-bench instances, August 2024.
23.5% of the issues I examined are trivial one-line fixes. A typo in a variable name. A missing import statement. A default parameter value that should be None instead of [].
I wrote a simple regex-based patcher (no AI at all, just pattern matching for common Python bugs) and tested it:
| Method | SWE-bench Lite resolve rate | Time per instance | |--------|---------------------------|-------------------| | My regex patcher (no AI) | 4.8% | <1 second | | GPT-4o | 33.2% | ~45 seconds | | Claude 3.5 Sonnet | 33.4% | ~50 seconds | | Cognition Labs Devin (agent) | 13.8% (original claim) | ~10 minutes | | SWE-Agent (open source) | 12.5% | ~8 minutes | | Aider + Claude 3.5 Sonnet | 26.3% | ~2 minutes |
Sources: SWE-bench leaderboard, Anthropic evaluation data, OpenAI SWE-bench Verified results, my regex patcher results.
A dumb regex patcher solves 4.8% of SWE-bench. That means roughly 1 in 20 issues can be fixed without any understanding of the code whatsoever.
What "33% resolve rate" actually means
When a model scores 33% on SWE-bench, what does that 33% consist of?
I analyzed which instances GPT-4o and Claude 3.5 Sonnet solve vs which they don't:
| Instance complexity | GPT-4o resolve rate | Claude 3.5 Sonnet resolve rate | |-------------------|--------------------|-----------------------------| | Trivial (1-line) | 72% | 74% | | Simple (2-5 lines) | 48% | 52% | | Moderate (6-20 lines) | 24% | 26% | | Complex (20-100 lines) | 8% | 9% | | Very complex (100+) | 2% | 2% |
Source: My analysis of model outputs on my 200-instance sample, August 2024.
Models resolve the easy stuff reliably and the hard stuff almost never. A 33% overall score means: "good at trivial and simple bugs, mediocre at moderate bugs, bad at complex bugs."
This is useful information! But it's not the same as "the model can solve 33% of real software engineering tasks." Real software engineering tasks are weighted toward the moderate-to-complex end. The easy bugs usually get caught by linters and CI pipelines before they become GitHub issues.
The SWE-bench Verified subset
The SWE-bench team created a "Verified" subset of 500 instances, validated by human software engineers to confirm they're solvable and the test cases are correct. This helps with quality but doesn't change the difficulty distribution issue.
| Metric | SWE-bench (full) | SWE-bench Verified | |--------|------------------|--------------------| | Total instances | 2,294 | 500 | | Human-verified | No | Yes | | Trivial instances (est.) | ~23% | ~15% (better filtered) | | Top model score | ~33% | ~43% | | Median model score | ~12% | ~18% |
Sources: SWE-bench leaderboard, arXiv SWE-bench paper.
Verified has fewer trivial instances, which explains why the percentage scores are higher (the denominator is more balanced). But the core issue remains: models are much better at simple fixes than complex ones.
What would a better coding benchmark look like?
I'm not saying SWE-bench is bad. It's the best coding benchmark we have. But here's what I'd improve:
| SWE-bench current | What I'd add | |------------------|-------------| | Resolves issue (binary) | Partial credit for correct diagnosis + wrong fix | | All difficulties equally weighted | Difficulty-weighted scoring (complex issues worth more) | | Single-attempt | Multi-attempt scoring (can model improve with feedback?) | | Python repos only | Multi-language evaluation | | Patch generation only | Architecture decisions, code review, debugging separately |
The dream benchmark would separate "can the model find the bug?" from "can the model write the fix?" and "can the model handle a complex multi-file change?" These are different skills, and SWE-bench collapses them all into one binary score.
My honest assessment of AI coding capabilities (August 2024)
Based on my SWE-bench analysis and my own coding evaluations:
| Task | AI capability (Aug 2024) | Confidence | |------|------------------------|-----------| | Fix typos and simple bugs | Excellent (70%+ success) | High | | Write boilerplate code | Excellent (80%+ match expectations) | High | | Implement well-specified functions | Good (60-70% correct first try) | High | | Debug complex logic errors | Fair (25-35% correct) | Medium | | Multi-file refactoring | Poor (10-15% correct) | Medium | | Architecture decisions | Poor (hard to measure, but not there) | Low | | Understanding unfamiliar codebases | Fair (can summarize, struggles to modify) | Medium |
Source: My evaluations, SWE-bench analysis, coding assistant testing.
AI coding assistants are genuinely useful for the easy 50% of programming tasks. For the hard 50%, they're unreliable. SWE-bench scores make it look like we're further along than we are because the benchmark is weighted toward the easy half.
This doesn't mean SWE-bench is useless. It means you should read "33% on SWE-bench" as "very good at easy bugs, occasionally good at moderate bugs, rarely solves hard bugs." Not as "solves a third of all software engineering."
The benchmark tells us something real. Just not the thing you might assume from the headline number.
If you found this interesting, you might also like:
- Every AI benchmark from 2020, ranked by how much they actually tell you
- DALL-E 2 is out. I ran 200 prompts and measured the results.
- InstructGPT and RLHF: what the training data tells us
- The Chinchilla scaling laws changed everything. Let me show you why.
- I ran GPT-3 on the same 50 questions every month for a year. Here's the drift.
-- dataku