Industry TrendsOctober 6, 20254 min read

NVIDIA B200 benchmarks are out. The inference economics just changed again.

The B200 delivers 2.5x the inference throughput of the H100 at roughly the same power consumption. I compared the per-token cost on B200 vs H100 vs H200. If you're running inference at scale, the upgrade pays for itself in 4 months.

NVIDIA's B200 GPU is shipping and the first real inference benchmarks are out. The numbers are better than the marketing slides suggested.

B200 vs H100 vs H200: inference performance

| Metric | B200 | H200 | H100 | B200 vs H100 | |--------|------|------|------|--------------| | FP8 inference (TOPS) | 4,500 | 1,979 | 1,513 | 2.97x | | Tokens/sec (Llama 3.1 70B) | 142 | 82 | 55 | 2.58x | | Tokens/sec (Llama 4 405B MoE) | 68 | 38 | 22 | 3.09x | | Memory (HBM3e) | 192 GB | 141 GB | 80 GB | 2.4x | | Memory bandwidth | 8 TB/s | 4.8 TB/s | 3.35 TB/s | 2.39x | | TDP (watts) | 1,000W | 700W | 700W | 1.43x |

Sources: NVIDIA, SemiAnalysis, early benchmark data from CoreWeave and cloud providers.

2.58x faster on Llama 3.1 70B. 3.09x faster on Llama 4 405B MoE. The MoE advantage is notable: the B200's larger memory (192GB) fits more expert weights, reducing the need for expert offloading.

Power consumption is higher (1,000W vs 700W), so the throughput-per-watt improvement is closer to 1.8x, not 2.5x. But at data center scale, throughput-per-dollar matters more than throughput-per-watt for most operators.

Cost comparison per million tokens

| Hardware | Hourly rental (est.) | Tokens/sec (70B) | Cost per M tokens | Relative cost | |----------|--------------------|--------------------|-------------------|--------------| | H100 SXM | $2.00/hr | 55 | $10.10 | 1.0x | | H200 SXM | $2.80/hr | 82 | $9.49 | 0.94x | | B200 SXM | $3.50/hr (est.) | 142 | $6.85 | 0.68x |

Sources: Lambda Labs, CoreWeave, my calculations. B200 rental pricing estimated based on NVIDIA pricing and cloud provider margins.

The B200 at an estimated $3.50/hr delivers tokens at $6.85 per million, vs $10.10 for H100. A 32% cost reduction per token.

For Llama 4 405B (MoE), the numbers are even more favorable:

| Hardware | Tokens/sec (405B MoE) | Cost per M tokens | |----------|----------------------|-------------------| | H100 SXM (8x) | 22 | $327 | | H200 SXM (8x) | 38 | $265 | | B200 SXM (4x) | 68 | $185 |

The B200 can run 405B MoE inference on 4 GPUs (768GB total) where the H100 needed 8 GPUs. Half the hardware for 3x the throughput.

Payback period for upgrading

| Scenario | Current | Upgrade | Monthly savings | Payback period | |----------|---------|---------|----------------|---------------| | 100M tokens/month on H100s | $1,010 | Buy B200 (~$35K) | $325 | 3.6 months | | 1B tokens/month on H100s | $10,100 | Buy B200s (~$140K for 4) | $3,250 | 1.4 months | | 10B tokens/month on H100s | $101,000 | Buy B200s (~$560K for 16) | $32,500 | 0.6 months |

At scale, the B200 pays for itself in under 2 months. Even at modest volumes (100M tokens/month), the payback is under 4 months.

Impact on the inference market

| Implication | Detail | |-------------|--------| | Cloud inference prices will drop 20-30% | As providers upgrade to B200, costs decrease | | H100 resale prices fall further | Already down 40%, B200 availability accelerates depreciation | | Self-hosting break-even lowers | Cheaper hardware = lower barrier to self-host | | MoE models get a boost | B200's large memory makes MoE architectures more practical |

Sources: Market analysis, cloud provider pricing trends.

I expect to see API price cuts from inference providers within 60-90 days of B200 deployment reaching scale. Together AI, Fireworks AI, and Groq will likely move first, followed by the major providers.

My take

The B200 is a straightforward generational upgrade: faster, more memory, slightly more power-hungry. No architectural revolution. Just a bigger, better chip.

But the compounding effect matters. H100 (2023) to H200 (2024) to B200 (2025): each generation delivering 1.5-2.5x more inference throughput. Over three generations, that's roughly a 5-7x improvement in inference cost.

Combined with algorithmic improvements (MoE, quantization, speculative decoding), the total inference cost reduction from 2023 to 2025 is closer to 20-30x. Hardware and software improvements are multiplying.

My GPU tracking spreadsheet just got a new row. The B200 numbers make everything above it look expensive.


If you found this interesting, you might also like:

-- dataku

More from dataku