Grok 3 and the xAI compute cluster: throwing brute force at AI
xAI built a 100K GPU cluster in Memphis. Grok 3 is the first model trained on it. The benchmarks are competitive with Claude 3.5 Sonnet and GPT-4o. I ran my standard evaluation. It's good, but the interesting story is the infrastructure bet.
Elon Musk's xAI spent a reported $3-4 billion building a 100,000 GPU cluster in Memphis, Tennessee. Called "Colossus," it's the largest known AI training cluster in the world.
Grok 3 is the first model to come out of it. And it's... actually good?
I ran my standard evaluation. Let me show you the numbers.
Benchmark comparison
| Benchmark | Grok 3 | Claude 3.5 Sonnet | GPT-4o | DeepSeek V3 | |-----------|--------|-------------------|--------|-------------| | MMLU | 89.2% | 88.7% | 88.7% | 87.1% | | GPQA Diamond | 61.3% | 59.4% | 53.6% | 59.1% | | MATH | 82.4% | 78.3% | 76.6% | 61.6% | | HumanEval | 89.7% | 93.7% | 90.2% | 82.6% | | AIME 2024 | 60.2% | N/A | N/A | 39.2% | | Chatbot Arena Elo | 1260 | 1269 | 1261 | 1249 | | SWE-bench Verified | 39.8% | 49.0% | 33.2% | 42.0% |
Sources: xAI blog post, LMSYS Chatbot Arena, prior model benchmarks, SWE-bench.
Grok 3 is competitive. On MMLU (89.2%) it slightly edges both Claude and GPT-4o. On MATH (82.4%) it beats both. On GPQA Diamond (61.3%) it leads the pack.
But on HumanEval (89.7%) it trails Claude 3.5 Sonnet (93.7%). And on SWE-bench Verified (39.8%) it trails both Claude (49.0%) and DeepSeek V3 (42.0%).
It's a frontier-class model. But it's not the clear #1 at anything except maybe pure math (and DeepSeek R1 crushes it there with reasoning enabled).
My 300-prompt evaluation
| Category (50 prompts each) | Grok 3 | Claude 3.5 Sonnet | GPT-4o | |----------------------------|--------|-------------------|--------| | Coding (Python) | 78% | 86% | 82% | | Coding (general) | 74% | 82% | 80% | | Analysis/reasoning | 82% | 80% | 78% | | Creative writing | 70% | 84% | 76% | | Factual Q&A | 86% | 84% | 88% | | Instruction following | 76% | 84% | 82% |
Grok 3 wins on factual Q&A and analysis/reasoning in my tests. Claude wins on coding and creative writing. GPT-4o wins on factual Q&A and general coding.
The creative writing score (70%) is notably low. Grok 3 has a very distinct voice that some raters liked but most found too opinionated for professional use. It tends to editorialize.
The infrastructure story
The benchmarks are fine. Good, even. But the real story is the infrastructure bet:
| Infrastructure metric | xAI Colossus | Estimated equivalents | |----------------------|-------------|----------------------| | GPUs | 100,000 H100 | Largest known cluster | | Construction time | ~4 months | Typical: 12-18 months | | Estimated cost | $3-4B | Similar to Microsoft's Azure AI infra spend | | Power consumption | ~150 MW | Enough for 100,000+ homes | | Location | Memphis, TN | Tennessee Valley Authority (cheap power) |
Sources: xAI announcements, Reuters reporting, industry estimates.
100,000 H100 GPUs in one cluster. At retail, H100s cost roughly $30,000 each. That's $3 billion in GPUs alone, before networking, cooling, power infrastructure, and the building itself.
The speed of construction is the most remarkable data point. Four months from breaking ground to operational. Normal data center construction takes 12-18 months. xAI essentially speed-ran it by buying prefab containers and running everything at maximum construction crew density.
Cost-efficiency vs DeepSeek
Here's the comparison that keeps nagging me:
| Metric | xAI (Grok 3) | DeepSeek (V3/R1) | |--------|-------------|-----------------| | Training infrastructure cost | $3-4B (estimated) | Unknown (much less) | | GPU count | 100,000 H100s | 2,048 H800s | | Training cost (single model) | Hundreds of millions (estimated) | $5.6M (V3) | | Benchmark range | Frontier-competitive | Frontier-competitive |
Two completely opposite approaches to the same goal. xAI threw maximum hardware at the problem. DeepSeek threw maximum algorithmic efficiency.
Both produced frontier-competitive models.
This is the most interesting tension in AI right now. Is the future "more compute" or "smarter compute"? The honest answer is probably both, but the DeepSeek R1 result makes the brute-force approach look less compelling per dollar spent.
Pricing
| Metric | Grok 3 (API) | Claude 3.5 Sonnet | GPT-4o | |--------|-------------|-------------------|--------| | Input per M tokens | $3.00 | $3.00 | $2.50 | | Output per M tokens | $15.00 | $15.00 | $10.00 |
Sources: xAI pricing, Anthropic, OpenAI.
xAI priced Grok 3 identically to Claude 3.5 Sonnet. Neither cheaper nor more expensive. For a model that's roughly in the same performance tier, that's a reasonable pricing strategy.
My take
Grok 3 is the first xAI model I'd recommend for production use. Grok 1 was a curiosity. Grok 2 was mediocre. Grok 3 is genuinely competitive with the best models available.
But "competitive with" and "better than" are different statements. Based on my data, I'd still choose Claude for coding, GPT-4o for multimodal, and Gemini for long context. Grok 3 doesn't have a clear "best at" category yet.
What it does have is a 100,000 GPU training cluster. If Grok 4 uses that full cluster with improved training recipes, the potential is significant.
The compute is there. The question is whether xAI's research team can use it as efficiently as DeepSeek uses their much smaller cluster. Right now, the data says DeepSeek extracts more benchmark points per GPU-hour.
My spreadsheet has a new column: "benchmark score per billion dollars of infrastructure." I'll let you guess who leads.
If you found this interesting, you might also like:
- Claude 3.5 Sonnet (new) and computer use: my first benchmark data
- Google's PaLM has 540 billion parameters. Let me put that number in context.
- DALL-E's first images vs what people expected: a data comparison
- GPT-3 vs GPT-J: the first real open source challenger, in data
- Midjourney v3 vs DALL-E 2: 100 prompts, head to head
-- dataku