Claude 3.5 Sonnet (new) and computer use: my first benchmark data
Anthropic updated Claude 3.5 Sonnet and added computer use. I tested both the model improvements and the computer use capability. Model quality jumped noticeably. Computer use works about 60% of the time in my tests.
Anthropic just released two things at once: an updated Claude 3.5 Sonnet (they're calling it the "new" version, same name) and a beta feature called "computer use" that lets Claude control a computer via screenshots and mouse/keyboard commands.
The model update is excellent. The computer use is... promising but rough. Let me show you the data on both.
The model quality update
I ran my standard 300-prompt evaluation on the new Claude 3.5 Sonnet and compared it to the June version:
| Category | Claude 3.5 Sonnet (Oct) | Claude 3.5 Sonnet (Jun) | Change | GPT-4o | |----------|------------------------|------------------------|--------|--------| | Factual Q&A (50) | 4.34 | 4.22 | +0.12 | 4.18 | | Code generation (50) | 4.62 | 4.48 | +0.14 | 4.28 | | Creative writing (50) | 4.36 | 4.28 | +0.08 | 4.02 | | Summarization (50) | 4.42 | 4.32 | +0.10 | 4.14 | | Reasoning (50) | 4.48 | 4.34 | +0.14 | 4.22 | | Instruction following (50) | 4.46 | 4.38 | +0.08 | 4.24 | | Overall | 4.45 | 4.34 | +0.11 | 4.18 |
Source: My evaluation, 300 prompts, blind rating, October 2024.
The new Claude 3.5 Sonnet scores 4.45, up from 4.34 in June. Every category improved. Code generation had the biggest jump (+0.14), and the 4.62 code generation score is the highest I've ever measured from any model.
The gap to GPT-4o widened: 4.45 vs 4.18, a 0.27 point lead. In June it was 0.16. Anthropic isn't just maintaining the lead. They're extending it.
The benchmark comparison
| Benchmark | Claude 3.5 Sonnet (Oct) | Claude 3.5 Sonnet (Jun) | GPT-4o | o1-preview | |-----------|------------------------|------------------------|--------|-----------| | MMLU | 88.7% | 88.7% | 88.7% | 90.8% | | HumanEval | 93.7% | 92.0% | 90.2% | 92.4% | | MATH | 78.3% | 71.1% | 60.3% | 83.3% | | SWE-bench Verified | 49.0% | 33.4% | 33.2% | 41.3% | | TAU-bench (agentic) | 62.4% | N/A | 50.1% | N/A |
Sources: Anthropic Claude 3.5 Sonnet announcement, OpenAI documentation, LMSYS Chatbot Arena, SWE-bench leaderboard.
The SWE-bench Verified jump is massive: 33.4% to 49.0%. That's a 15.6 point improvement. The October version solves nearly half of the verified real-world coding issues, up from a third.
And on TAU-bench (a new agentic benchmark measuring tool use and multi-step task completion), Claude 3.5 Sonnet scores 62.4% vs GPT-4o's 50.1%.
Computer use: the raw data
Now for the new capability. Computer use lets Claude look at screenshots, move the mouse, click, type, and interact with any computer interface.
I tested it on 50 tasks across 5 categories:
| Task category | Tasks | Completed | Completion rate | Avg time | Avg attempts | |-------------|-------|-----------|----------------|----------|-------------| | Web browsing (find info) | 10 | 8 | 80% | 45 sec | 1.3 | | Form filling | 10 | 7 | 70% | 62 sec | 1.8 | | File management | 10 | 6 | 60% | 38 sec | 1.5 | | Application interaction | 10 | 5 | 50% | 78 sec | 2.4 | | Multi-step workflows | 10 | 3 | 30% | 120+ sec | 3.2 |
Source: My testing using Anthropic computer use API, October 2024. Tasks performed in a Docker container with Ubuntu desktop.
Overall: 29 of 50 tasks completed successfully. 58% success rate.
The success rate varies dramatically by task complexity. Simple web browsing (80%) is reasonably reliable. Multi-step workflows (30%) are not.
Where computer use fails
I categorized the failure modes:
| Failure type | Occurrences (out of 21 failures) | Example | |-------------|----------------------------------|---------| | Misidentified UI element | 8 | Clicked "Cancel" instead of "OK" (buttons close together) | | Lost context mid-task | 5 | Forgot what it was doing after a page reload | | Screenshot misinterpretation | 4 | Couldn't read small text or low-contrast elements | | Timing issue | 2 | Clicked before page finished loading | | Unexpected popup/dialog | 2 | Got stuck on a cookie consent banner |
Source: My analysis of 21 failed tasks in computer use testing.
The most common failure (8 of 21) is misidentifying UI elements. The model processes screenshots at a fixed resolution and sometimes can't distinguish between adjacent buttons or clickable elements. Small text and low-contrast UIs are particularly problematic.
Computer use cost
Computer use is expensive because every action requires a screenshot (image tokens) plus the model processing:
| Task type | Avg screenshots | Avg total tokens | Avg cost per task | |-----------|----------------|-----------------|-------------------| | Simple (web browse) | 4 | 8,000 | $0.14 | | Medium (form fill) | 7 | 14,000 | $0.25 | | Complex (multi-step) | 12+ | 28,000+ | $0.50+ |
Source: My measurements, October 2024.
A simple task costs ~$0.14. A complex multi-step workflow costs $0.50+. And at a 30% success rate for complex tasks, the cost per successful complex task is $1.67+.
For comparison, a human virtual assistant on Fiverr costs roughly $3-10 per task. Computer use isn't cheaper yet, but it's available instantly (no hiring, no scheduling, no communication overhead).
The comparison that matters: computer use vs API integration
| Approach | Setup time | Reliability | Cost per task | Flexibility | |----------|-----------|-------------|---------------|-------------| | API integration (if available) | Hours to days | 99%+ | $0.001-0.01 | Low (API-specific) | | Computer use (Claude) | Minutes | ~58% | $0.14-0.50 | High (any UI) | | Human (virtual assistant) | Hours (hiring) | 95%+ | $3-10 | Very high | | RPA (traditional automation) | Days to weeks | 90%+ | $0.01-0.05 | Medium (brittle) |
Computer use sits in a specific niche: tasks where no API exists but automation is valuable. If there's an API, use it. If there isn't, computer use is faster to set up than RPA and more reliable than you'd expect for simple tasks.
My honest assessment
The updated Claude 3.5 Sonnet is the best general-purpose model I've tested. 4.45 on my evaluation, with a standout 4.62 on code generation. That's not debatable. The data is clear.
Computer use is a beta feature that works 58% of the time. That's not good enough for production use. But it's a 1.0 of a capability that didn't exist last week. If the reliability curve follows the pattern of other AI capabilities (rapid improvement over 6-12 months), computer use could be genuinely useful by mid-2025.
I'm adding a "computer use success rate" row to my model tracking spreadsheet. First entry: 58%. Let's see where it goes.
If you found this interesting, you might also like:
- Google's PaLM has 540 billion parameters. Let me put that number in context.
- Claude vs GPT-4: my first head-to-head data comparison
- Llama 2 is here and it's actually good. My benchmark data.
- Mistral Large vs GPT-4 vs Claude 3 Opus: the three-way benchmark
- DALL-E's first images vs what people expected: a data comparison
-- dataku