Claude Opus 4.6 review: the 1M context model
Anthropic shipped a 1 million token context window on their flagship model. I tested retrieval at 100K, 250K, 500K, and 1M tokens. Accuracy stays above 90% up to 500K. At 1M it drops to 78%, but that's still usable. The long-context game has a new leader.
Anthropic shipped Claude Opus 4.6 with two headline features: a 1 million token context window and (according to the announcement) the model that "sets new records on coding and reasoning while maintaining the quality users expect from the Opus tier."
The 1M context window is the story. Let me test it.
Context window comparison
| Model | Context window | Provider | |-------|---------------|---------| | Claude Opus 4.6 | 1,000,000 | Anthropic | | Gemini 2.5 Ultra | 2,000,000 | Google | | Claude Opus 4.5 | 200,000 | Anthropic | | GPT-4o | 128,000 | OpenAI | | DeepSeek R2 | 128,000 | DeepSeek |
Anthropic went from 200K to 1M. A 5x increase. Google still has the largest window at 2M, but Anthropic is now in the same league.
Needle-in-a-haystack retrieval test
I placed a specific fact ("The secret code is: PURPLE UMBRELLA 7492") at random positions within documents of increasing size. Then asked the model to find and repeat the code.
| Document size | Claude Opus 4.6 | Claude Opus 4.5 | Gemini 2.5 Ultra | |--------------|-----------------|-----------------|-----------------| | 50K tokens | 98% | 98% | 98% | | 100K tokens | 96% | 95% | 97% | | 200K tokens | 94% | 91% | 95% | | 250K tokens | 93% | N/A (limit) | 94% | | 500K tokens | 90% | N/A | 92% | | 750K tokens | 84% | N/A | 88% | | 1M tokens | 78% | N/A | 86% | | 1.5M tokens | N/A | N/A | 82% | | 2M tokens | N/A | N/A | 76% |
Sources: My needle-in-a-haystack tests, 50 runs per size, March 2026.
At 500K tokens, Claude Opus 4.6 maintains 90% retrieval accuracy. That's impressive. The model can find a hidden fact in a context window equivalent to about 750 pages with 90% reliability.
At 1M tokens, accuracy drops to 78%. Usable, but you'll miss the needle about 1 in 5 times.
Gemini 2.5 Ultra is slightly better at equivalent lengths: 86% at 1M vs Claude's 78%. Google's long-context engineering is still ahead.
But Claude at 1M is dramatically better than Gemini at 2M (78% vs 76%). So within their respective windows, Claude's accuracy-per-token is competitive.
General benchmark comparison
| Benchmark | Opus 4.6 | Opus 4.5 | Delta | |-----------|---------|---------|-------| | HumanEval | 98.4% | 98.2% | +0.2 | | SWE-bench V | 65.1% | 64.2% | +0.9 | | GPQA Diamond | 80.4% | 79.8% | +0.6 | | MATH | 98.6% | 98.4% | +0.2 | | Chatbot Arena | 1301 | 1298 | +3 | | LiveCodeBench | 79.2% | 78.6% | +0.6 |
Sources: Anthropic Opus 4.6 announcement, LMSYS Chatbot Arena, early benchmark data.
Small improvements across the board. The main story of 4.6 is the context window, not raw quality. Opus 4.5 was already the best general model. 4.6 adds long context to that lead.
SWE-bench at 65.1% is a new record. But only 0.9 points above 4.5. We're in diminishing returns territory on coding benchmarks.
Practical long-context testing
Beyond needle-in-a-haystack, I tested real-world long-context tasks:
| Task | Context size | Opus 4.6 quality | Gemini Ultra quality | |------|------------|-------------------|---------------------| | Summarize a 200-page technical document | ~300K tokens | Excellent | Excellent | | Q&A over an entire codebase (100 files) | ~500K tokens | Very good | Good | | Cross-reference 5 legal contracts | ~400K tokens | Very good | Very good | | Analyze a full year of email (10K emails) | ~800K tokens | Good | Good | | Process an entire book + answer questions | ~250K tokens | Excellent | Excellent |
Both models handle 200-500K token tasks well. Above 500K, performance degrades but remains useful.
The codebase Q&A task is where I noticed the biggest practical difference. At 500K tokens, Opus 4.6 correctly identified cross-file dependencies that Gemini missed 15% of the time.
Pricing
| Model | Input/M | Output/M | 500K context cost | |-------|---------|----------|--------------------| | Claude Opus 4.6 | $15.00 | $75.00 | $7.50 input | | Claude Opus 4.5 | $15.00 | $75.00 | Limited to 200K | | Gemini 2.5 Ultra | $5.00 | $20.00 | $2.50 input |
Processing a 500K token document costs $7.50 on Opus 4.6 vs $2.50 on Gemini Ultra. Gemini is 3x cheaper for equivalent context sizes.
At 1M tokens, Opus 4.6 costs $15 for input alone. That's meaningful. Long-context processing on the Opus tier is a premium experience in every sense.
My assessment
| Feature | Verdict | |---------|---------| | 1M context window quality | Good (90% at 500K, 78% at 1M) | | Improvement over 4.5 | Mainly the context window, small quality gains | | vs Gemini Ultra | Lower context ceiling (1M vs 2M), better coding | | Cost for long context | Expensive ($15/M input), 3x more than Gemini | | General model quality | Best overall (Arena 1301, SWE-bench 65.1%) |
Claude Opus 4.6 is the best general model with a genuinely useful 1M context window. It's not the cheapest long-context option (Gemini is 3x cheaper) and not the largest window (Gemini has 2M). But for tasks that require both high quality and long context, it's the first model that delivers both from the same provider.
The long-context game now has a real leader at the premium tier. My spreadsheet for model selection just got a new dimension.
If you found this interesting, you might also like:
- Claude vs GPT-4: my first head-to-head data comparison
- Mistral Large vs GPT-4 vs Claude 3 Opus: the three-way benchmark
- Claude 3.5 Sonnet is better than Claude 3 Opus. And it's 5x cheaper.
- Claude 3.5 Sonnet (new) and computer use: my first benchmark data
- Claude Opus 4 is here. My first benchmark impressions.
-- dataku