Industry TrendsMarch 25, 20248 min read

The AI chip market in 2024: not just NVIDIA anymore

I compiled specs and benchmarks for every AI accelerator announced in the last 12 months. NVIDIA H100, AMD MI300X, Google TPU v5e, Groq LPU, Intel Gaudi 3, and 8 others. The competition is finally real.

For three years, "AI chip" basically meant "NVIDIA GPU." The H100 was the only chip that mattered. Supply was constrained. Every AI company was fighting over allocation. Jensen Huang wore the leather jacket of a monopolist.

That era is ending. I spent the last month compiling every AI accelerator that's shipping or announced for 2024. The list is longer than I expected, and the competitive picture is changing.

The full accelerator comparison

| Chip | Company | Process node | Memory | Memory bandwidth | FP16 TFLOPS | TDP | Availability | Est. price | |------|---------|-------------|--------|-----------------|-------------|-----|-------------|-----------| | H100 SXM | NVIDIA | 4nm | 80GB HBM3 | 3.35 TB/s | 989 | 700W | Shipping | ~$30,000 | | H200 SXM | NVIDIA | 4nm | 141GB HBM3e | 4.8 TB/s | 989 | 700W | Q2 2024 | ~$35,000 | | B200 SXM | NVIDIA | 4nm | 192GB HBM3e | 8 TB/s | 2,250 | 1000W | H2 2024 | ~$40,000+ | | MI300X | AMD | 5nm | 192GB HBM3 | 5.3 TB/s | 1,307 | 750W | Shipping | ~$15,000 | | TPU v5e | Google Cloud | 7nm | 16GB HBM2e | 1.6 TB/s | ~197 | ~200W | Cloud only | ~$1.20/hr | | TPU v5p | Google Cloud | Unknown | 95GB HBM | 4.8 TB/s | ~459 | ~400W | Cloud only | ~$4.20/hr | | LPU (GroqChip) | Groq | 14nm | 230MB SRAM | 80 TB/s | ~188 | 300W | Cloud API only | Pay-per-token | | Gaudi 3 | Intel/Habana | 5nm | 128GB HBM2e | 3.7 TB/s | ~1,835 | 900W | H2 2024 | ~$15,000 est. | | Trainium2 | AWS | Unknown | Unknown | Unknown | Unknown | Unknown | Late 2024 | Cloud only | | CS-3 | Cerebras | TSMC 5nm | 44GB SRAM | 21 PB/s | ~125 | 23,000W | Cloud API | Pay-per-token | | SN40L | SambaNova | TSMC 7nm | 1.5TB (full system) | Unknown | Unknown | Unknown | System only | ~$250K system | | Bow | Graphcore | 7nm | 0.9GB SRAM + 256GB host | 65 TB/s | 350 | 300W | Shipping | ~$6,000 | | MTia v2 | Meta | 5nm | Unknown | Unknown | Unknown | Unknown | Internal only | Not for sale | | TPU Trillium (v6) | Google Cloud | Unknown | Unknown | Unknown | Unknown | Unknown | 2025 | Cloud only |

Sources: Official spec sheets, company announcements, SemiAnalysis, press reporting, cloud pricing pages. Some specs are estimates based on available information.

That's 14 chips from 10 different companies. A year ago, this table would have had three entries.

The performance picture

Raw TFLOPS don't tell the full story. What matters for LLM inference is memory bandwidth (how fast you can read model weights) and for training, it's a combination of compute and interconnect speed.

Let me break this into what actually matters for each workload:

For LLM inference

| Chip | Memory bandwidth | Best use case | Tokens/sec estimate (Llama 2 70B) | |------|-----------------|--------------|-----------------------------------| | H100 SXM | 3.35 TB/s | General inference | ~90 | | MI300X | 5.3 TB/s | High-throughput inference | ~130 | | H200 SXM | 4.8 TB/s | Large model inference (141GB) | ~120 | | Groq LPU | 80 TB/s (SRAM) | Ultra-low latency inference | ~800 | | Cerebras CS-3 | 21 PB/s | Wafer-scale inference | ~1,800 | | TPU v5p | 4.8 TB/s | Batch inference at scale | ~110 |

Sources: My estimates based on memory bandwidth ratios, validated against published benchmarks where available. Groq numbers from my February measurements. Cerebras numbers from their published demos.

AMD's MI300X has 58% more memory bandwidth than the H100. That translates to roughly 44% faster inference throughput. And it costs half as much. The MI300X is the best price/performance chip for LLM inference on paper. The challenge is software maturity: CUDA has a 15-year head start over ROCm.

Groq and Cerebras are in a different category entirely. Their architectures trade memory capacity for bandwidth. The speed is extraordinary, but you can only run models that fit in their constrained memory.

For LLM training

| Chip | FP16 TFLOPS | Memory | Interconnect | Best for training? | |------|-------------|--------|-------------|-------------------| | H100 SXM | 989 | 80GB | NVLink 900 GB/s | Yes (standard) | | B200 SXM | 2,250 | 192GB | NVLink 1.8 TB/s | Yes (next gen) | | MI300X | 1,307 | 192GB | Infinity Fabric | Yes (cost effective) | | Gaudi 3 | ~1,835 | 128GB | RoCE | Maybe (software?) | | TPU v5p | ~459 | 95GB | ICI 4.8 TB/s | Yes (at Google's scale) |

Sources: Official specs, SemiAnalysis analysis.

For training, NVIDIA still wins because of CUDA, the NVLink interconnect, and the sheer amount of training infrastructure already built around their chips. The B200 (shipping late 2024) will be roughly 2.3x the compute of an H100 with 2.4x the memory. That's enough to maintain NVIDIA's lead.

But Intel Gaudi 3's raw TFLOPS number (1,835) is higher than anything else on this list. If the software matures, it could be a real contender for training workloads. Big "if" though.

The market share reality check

Impressive specs don't mean market share. Here's the current picture:

| Company | Est. AI accelerator revenue (2023) | Market share | Trajectory | |---------|-----------------------------------|-------------|------------| | NVIDIA | ~$40B | ~80% | Growing | | AMD | ~$2.3B | ~5% | Growing fast | | Google (TPU, internal) | ~$3B | ~6% | Stable | | Intel/Habana | ~$0.5B | ~1% | Uncertain | | Groq | <$0.1B | &lt;1% | Early stage | | Cerebras | <$0.1B | &lt;1% | Early stage | | Others | ~$3B | ~7% | Mixed |

Sources: Company earnings reports, analyst estimates, SemiAnalysis, press reporting. All numbers are approximate.

NVIDIA at 80% market share. That's a monopoly by any practical measure. But the $40B in revenue is a signal: the market is large enough to support multiple viable competitors. AMD going from ~$0 to $2.3B in one year shows the demand for alternatives.

My five observations

1. Memory is the new bottleneck. Every chip announcement in 2024 leads with memory bandwidth, not compute TFLOPS. H200 increased memory by 76% while keeping compute flat. The industry figured out that for inference, you're limited by how fast you can read weights, not how fast you can multiply matrices.

2. AMD MI300X is the real competitive threat. Not Groq. Not Cerebras. AMD. Because AMD has a traditional chip business, a maturing software stack (ROCm), and chips that slot into existing data center infrastructure. The MI300X at $15K vs H100 at $30K with better specs is a simple purchasing decision if the software works.

3. The startup chips are niche plays (for now). Groq and Cerebras have genuinely impressive technology, but they'll serve specific inference workloads, not general-purpose AI compute. That's fine. Niche plays can be billion-dollar businesses. But they won't dethrone NVIDIA.

4. Google's TPU strategy is underrated. Google builds their own chips AND uses NVIDIA GPUs. They can switch between them for different workloads. No other company has this optionality. And TPU v5p is competitive on both training and inference.

5. NVIDIA's real moat is software. CUDA, cuDNN, TensorRT, Triton. Fifteen years of developer tools that every AI framework is built on. The chip specs are one thing. The software that makes chips usable is another. Every NVIDIA competitor's hardest problem isn't building a faster chip. It's building a software stack that developers will actually use.

The AI chip market went from a monopoly to a competition in about 18 months. I'm tracking 14 chips now. By this time next year, it'll probably be 20. My spreadsheet grows. My data analyst heart is happy.


If you found this interesting, you might also like:

-- dataku

More from dataku