Model ComparisonsJune 10, 20247 min read

I benchmarked 12 coding assistants. Cursor is not what I expected.

GitHub Copilot, Cursor, Cody, Continue, Tabnine, and 7 others. I used each one for a full week and tracked acceptance rates, bug rates, and time saved. Cursor surprised me. Copilot disappointed me.

I spent 12 weeks using a different AI coding assistant each week. Same codebase (a medium-sized Python web app, ~15K lines). Same tasks each week (bug fixes, new features, refactors, tests). I tracked everything.

This was the most time-consuming evaluation I've ever run. Here's every number.

The contenders

| Tool | Version tested | Model(s) used | Price/month | IDE support | |------|---------------|---------------|-------------|-------------| | GitHub Copilot | Business | GPT-4 + custom | $19 | VS Code, JetBrains, Neovim | | Cursor | Pro | GPT-4, Claude 3.5 Sonnet | $20 | Cursor (VS Code fork) | | Sourcegraph Cody | Pro | Claude 3, StarCoder | $9 | VS Code, JetBrains | | Continue | Free | Any (I used Claude 3 Sonnet) | Free + API costs | VS Code, JetBrains | | Tabnine | Pro | Tabnine models | $12 | VS Code, JetBrains, others | | Replit AI | Pro | Custom models | $25 | Replit IDE only | | Amazon CodeWhisperer | Professional | CodeWhisperer model | $19 | VS Code, JetBrains | | Codeium | Pro | Custom models | $12 | VS Code, JetBrains, others | | Supermaven | Pro | Custom (Supermaven) | $10 | VS Code | | Aider | Free | Claude 3.5 Sonnet (my API) | Free + API costs | Terminal | | JetBrains AI | Included | JetBrains AI models | $8.90 | JetBrains only | | Qodo (formerly Codium) | Free tier | Various | Free | VS Code, JetBrains |

Sources: Official pricing pages, June 2024.

The headline numbers

| Tool | Suggestion acceptance rate | Bug introduction rate | Time saved (est.) | Codebase-awareness | Overall score (1-10) | |------|--------------------------|----------------------|-------------------|-------------------|---------------------| | Cursor | 52% | 8% | 38% | Excellent | 8.7 | | Aider | 41% | 6% | 42% | Excellent | 8.2 | | Continue | 38% | 9% | 29% | Good | 7.4 | | GitHub Copilot | 44% | 14% | 24% | Poor | 6.8 | | Sourcegraph Cody | 36% | 7% | 26% | Excellent | 7.1 | | Supermaven | 48% | 11% | 22% | Fair | 6.5 | | Codeium | 40% | 12% | 20% | Fair | 6.2 | | Amazon CodeWhisperer | 32% | 15% | 16% | Poor | 5.4 | | Tabnine | 34% | 10% | 18% | Fair | 5.8 | | Replit AI | 38% | 13% | 22% | Good (Replit only) | 6.0 | | JetBrains AI | 30% | 12% | 15% | Fair | 5.2 | | Qodo | 28% | 8% | 14% | Good | 5.5 |

Source: My measurements over 12 weeks, one week per tool, same tasks and codebase. "Bug introduction rate" = percentage of accepted suggestions that required fixes. "Time saved" = estimated reduction in task completion time vs no AI assistance.

The Cursor surprise

I went in expecting GitHub Copilot to win. It's the most popular tool, it's backed by OpenAI and GitHub, and it has the largest user base. I expected Cursor to be a niche VS Code fork with marginal improvements.

I was wrong.

Cursor scored 8.7/10 overall, the highest of any tool. The three things that separated it:

| Feature | Cursor | Copilot | Why it matters | |---------|--------|---------|---------------| | Codebase indexing | Indexes entire project, references relevant files | Limited context window, often misses context | Suggestions reference your actual code, not generic patterns | | Multi-file edits | Can edit multiple files in one operation | Single-file inline suggestions only | Refactors are the hardest task; Cursor handles them | | Model selection | GPT-4, Claude 3.5 Sonnet, mix and match | GPT-4 (fixed) | Claude 3.5 Sonnet was better for code in my experience | | Chat + edit integration | Chat proposes, you accept diffs | Chat and autocomplete feel like separate features | Workflow is smoother, fewer context switches |

The codebase awareness is the killer feature. When I asked Cursor to "add error handling to the payment endpoint," it found the payment endpoint across 3 files, understood the existing error patterns in my codebase, and generated changes consistent with my project's style. Copilot gave me generic error handling that didn't match my patterns.

The Copilot disappointment

GitHub Copilot has a 44% acceptance rate, which sounds good until you see the 14% bug introduction rate. That means 14% of accepted suggestions had bugs that required me to go back and fix them.

| Copilot strength | Copilot weakness | |-----------------|-----------------| | Fast inline suggestions (50ms) | Poor codebase awareness | | Works in any file type | High bug rate (14%) | | Good for boilerplate code | Bad at multi-file changes | | Reliable availability | Chat feature feels bolted on | | Huge plugin and extension support | Suggestions ignore project context |

The 14% bug rate is concerning. If I accept 44 suggestions per 100 offered, and 6 of those 44 introduce bugs, I'm spending time fixing AI-generated bugs. At a certain point, the time spent fixing AI bugs eats into the time saved by AI suggestions.

For simple, single-file autocomplete (writing a for loop, completing a function signature), Copilot is fine. For anything that requires understanding your project as a whole, it falls short.

The underrated tools

Aider scored 8.2/10 and it's free (you bring your own API key). It's a terminal-based tool that reads your entire git repo, understands your codebase, and makes changes via diffs. It has the lowest bug introduction rate (6%) and the highest estimated time savings (42%). The catch: it's terminal-only, no IDE integration, and the learning curve is steep.

Sourcegraph Cody at 7.1/10 has excellent codebase awareness thanks to Sourcegraph's code graph indexing. It knows about every function, class, and import in your project. The suggestions are contextually accurate. The low score comes from a lower acceptance rate (36%) and slower suggestion speed.

Cost efficiency

| Tool | Monthly cost | My overall score | Score per dollar | |------|-------------|-----------------|-----------------| | Continue | ~$8 (API costs) | 7.4 | 0.925 | | Aider | ~$12 (API costs) | 8.2 | 0.683 | | Sourcegraph Cody | $9 | 7.1 | 0.789 | | Cursor | $20 | 8.7 | 0.435 | | Supermaven | $10 | 6.5 | 0.650 | | GitHub Copilot | $19 | 6.8 | 0.358 | | Tabnine | $12 | 5.8 | 0.483 | | Amazon CodeWhisperer | $19 | 5.4 | 0.284 |

On a score-per-dollar basis, Continue (free tool + API costs) wins. But the absolute best experience is Cursor at $20/month.

Copilot at $19/month has the worst score-per-dollar of the popular tools (0.358). You're paying almost the same as Cursor and getting significantly less.

My setup going forward

After 12 weeks, I'm switching my daily driver from Copilot to Cursor. The codebase awareness and multi-file editing alone are worth the $20/month. For quick experiments and throwaway scripts, I'll use Aider in the terminal.

I expected this evaluation to confirm that Copilot was the default choice. Instead, it showed me that the default choice is a year behind the best alternative. The coding assistant market is moving fast, and the incumbent isn't keeping up.

The data is clear. My IDE has been changed. And my spreadsheet has 12 new rows.


If you found this interesting, you might also like:

-- dataku

More from dataku