NVIDIA B200 benchmarks are out. The inference economics just changed again.
The B200 delivers 2.5x the inference throughput of the H100 at roughly the same power consumption. I compared the per-token cost on B200 vs H100 vs H200. If you're running inference at scale, the upgrade pays for itself in 4 months.
NVIDIA's B200 GPU is shipping and the first real inference benchmarks are out. The numbers are better than the marketing slides suggested.
B200 vs H100 vs H200: inference performance
| Metric | B200 | H200 | H100 | B200 vs H100 | |--------|------|------|------|--------------| | FP8 inference (TOPS) | 4,500 | 1,979 | 1,513 | 2.97x | | Tokens/sec (Llama 3.1 70B) | 142 | 82 | 55 | 2.58x | | Tokens/sec (Llama 4 405B MoE) | 68 | 38 | 22 | 3.09x | | Memory (HBM3e) | 192 GB | 141 GB | 80 GB | 2.4x | | Memory bandwidth | 8 TB/s | 4.8 TB/s | 3.35 TB/s | 2.39x | | TDP (watts) | 1,000W | 700W | 700W | 1.43x |
Sources: NVIDIA, SemiAnalysis, early benchmark data from CoreWeave and cloud providers.
2.58x faster on Llama 3.1 70B. 3.09x faster on Llama 4 405B MoE. The MoE advantage is notable: the B200's larger memory (192GB) fits more expert weights, reducing the need for expert offloading.
Power consumption is higher (1,000W vs 700W), so the throughput-per-watt improvement is closer to 1.8x, not 2.5x. But at data center scale, throughput-per-dollar matters more than throughput-per-watt for most operators.
Cost comparison per million tokens
| Hardware | Hourly rental (est.) | Tokens/sec (70B) | Cost per M tokens | Relative cost | |----------|--------------------|--------------------|-------------------|--------------| | H100 SXM | $2.00/hr | 55 | $10.10 | 1.0x | | H200 SXM | $2.80/hr | 82 | $9.49 | 0.94x | | B200 SXM | $3.50/hr (est.) | 142 | $6.85 | 0.68x |
Sources: Lambda Labs, CoreWeave, my calculations. B200 rental pricing estimated based on NVIDIA pricing and cloud provider margins.
The B200 at an estimated $3.50/hr delivers tokens at $6.85 per million, vs $10.10 for H100. A 32% cost reduction per token.
For Llama 4 405B (MoE), the numbers are even more favorable:
| Hardware | Tokens/sec (405B MoE) | Cost per M tokens | |----------|----------------------|-------------------| | H100 SXM (8x) | 22 | $327 | | H200 SXM (8x) | 38 | $265 | | B200 SXM (4x) | 68 | $185 |
The B200 can run 405B MoE inference on 4 GPUs (768GB total) where the H100 needed 8 GPUs. Half the hardware for 3x the throughput.
Payback period for upgrading
| Scenario | Current | Upgrade | Monthly savings | Payback period | |----------|---------|---------|----------------|---------------| | 100M tokens/month on H100s | $1,010 | Buy B200 (~$35K) | $325 | 3.6 months | | 1B tokens/month on H100s | $10,100 | Buy B200s (~$140K for 4) | $3,250 | 1.4 months | | 10B tokens/month on H100s | $101,000 | Buy B200s (~$560K for 16) | $32,500 | 0.6 months |
At scale, the B200 pays for itself in under 2 months. Even at modest volumes (100M tokens/month), the payback is under 4 months.
Impact on the inference market
| Implication | Detail | |-------------|--------| | Cloud inference prices will drop 20-30% | As providers upgrade to B200, costs decrease | | H100 resale prices fall further | Already down 40%, B200 availability accelerates depreciation | | Self-hosting break-even lowers | Cheaper hardware = lower barrier to self-host | | MoE models get a boost | B200's large memory makes MoE architectures more practical |
Sources: Market analysis, cloud provider pricing trends.
I expect to see API price cuts from inference providers within 60-90 days of B200 deployment reaching scale. Together AI, Fireworks AI, and Groq will likely move first, followed by the major providers.
My take
The B200 is a straightforward generational upgrade: faster, more memory, slightly more power-hungry. No architectural revolution. Just a bigger, better chip.
But the compounding effect matters. H100 (2023) to H200 (2024) to B200 (2025): each generation delivering 1.5-2.5x more inference throughput. Over three generations, that's roughly a 5-7x improvement in inference cost.
Combined with algorithmic improvements (MoE, quantization, speculative decoding), the total inference cost reduction from 2023 to 2025 is closer to 20-30x. Hardware and software improvements are multiplying.
My GPU tracking spreadsheet just got a new row. The B200 numbers make everything above it look expensive.
If you found this interesting, you might also like:
- The GPU shortage data: who has capacity and who's lying about it
- The AI chip market in 2024: not just NVIDIA anymore
- The H100 resale market is crashing. Pricing data from 6 months.
- The GPT-3 API waitlist is 6 months long. Here's what the early data looks like.
- I counted every AI startup that raised money in Q1 2021. The numbers are strange.
-- dataku