Pricing WatchJanuary 8, 20245 min read

Mixtral 8x7B is free to run and matches GPT-3.5. The inference economics are changing.

I set up Mixtral on a single A100 and benchmarked throughput. At 95 tokens/second, the cost per million tokens is $0.18. The OpenAI API charges $0.50. Open source inference is now genuinely cheaper.

I spent my holiday break doing what any normal person does: running GPU benchmarks in my apartment.

(My partner says this is not normal. I disagree.)

I set up Mixtral 8x7B on a single A100 80GB using vLLM and ran throughput tests for three days straight. I wanted the real numbers, not the theoretical ones from blog posts.

The setup

Single NVIDIA A100 80GB, rented from Lambda Labs at $1.29/hr. vLLM serving framework with continuous batching. Mixtral 8x7B-Instruct-v0.1 in FP16.

Total cost of this experiment: $92.88. My most expensive holiday gift to myself.

Throughput results

| Batch size | Tokens/second (output) | Latency p50 | Latency p99 | VRAM used | |-----------|----------------------|------------|------------|-----------| | 1 | 42 | 24ms/token | 31ms/token | 87GB | | 4 | 95 | 42ms/token | 58ms/token | 88GB | | 8 | 156 | 51ms/token | 74ms/token | 89GB | | 16 | 201 | 80ms/token | 112ms/token | 90GB | | 32 | 218 | 147ms/token | 203ms/token | 91GB |

Source: My measurements over 72 hours, A100 80GB, vLLM 0.2.7, FP16.

At batch size 4 (a reasonable real-world scenario with moderate traffic), Mixtral pushes 95 tokens/second. That translates to about 8.2 million tokens per hour.

The cost math

Here's where it gets interesting. At $1.29/hr for the A100 rental:

| Model | Provider | $/M input tokens | $/M output tokens | Quality (MMLU) | |-------|----------|-----------------|-------------------|---------------| | GPT-3.5-turbo | OpenAI API | $0.50 | $1.50 | 70.0% | | GPT-3.5-turbo (fine-tuned) | OpenAI API | $3.00 | $6.00 | Varies | | Mixtral 8x7B | Together AI | $0.60 | $0.60 | 70.6% | | Mixtral 8x7B | Self-hosted (A100) | ~$0.16 | ~$0.16 | 70.6% | | Mixtral 8x7B | Self-hosted (2x RTX 4090) | ~$0.09 | ~$0.09 | 70.6% |

Sources: Provider pricing pages as of January 2024, my throughput calculations.

Wait. Let me double-check that self-hosted number. At batch size 4: 95 tokens/second = 342,000 tokens/hour. At $1.29/hour, that's $0.00000377 per token. Per million tokens: $0.16.

So Mixtral self-hosted on an A100 is roughly 9x cheaper than GPT-3.5-turbo's output pricing. And 3.7x cheaper than Together AI's hosted Mixtral.

If you own the hardware (or use a cheaper cloud like RunPod at ~$0.80/hr for an A100), the economics get even more favorable.

The break-even analysis

Self-hosting isn't free, though. You need:

  1. GPU rental or ownership
  2. Someone to manage the infrastructure
  3. Monitoring, load balancing, failover

I estimated the break-even point:

| Monthly token volume | GPT-3.5-turbo API cost | Self-hosted Mixtral cost | Savings | |---------------------|----------------------|------------------------|---------| | 10M tokens | $15 | $120 (fixed cost dominates) | -$105 | | 100M tokens | $150 | $155 | -$5 | | 500M tokens | $750 | $240 | +$510 | | 1B tokens | $1,500 | $360 | +$1,140 | | 10B tokens | $15,000 | $2,300 | +$12,700 |

Source: My estimates. Self-hosted cost assumes dedicated A100, 30-day amortized rental, 70% utilization.

The crossover point is around 100M tokens per month. Below that, the API is simpler and roughly the same cost. Above it, self-hosting wins quickly.

For context, 100M tokens is approximately 75,000 pages of text. If your application processes fewer than 75K pages of text per month, just use the API.

Quality verification

I didn't want to just trust the community benchmarks. I ran my own 200-prompt eval comparing Mixtral to GPT-3.5-turbo (January 2024 version):

| Task type | Mixtral 8x7B win rate | GPT-3.5-turbo win rate | Tie | |-----------|---------------------|----------------------|-----| | Factual Q&A | 46% | 41% | 13% | | Summarization | 38% | 52% | 10% | | Code generation | 43% | 48% | 9% | | Creative writing | 51% | 39% | 10% | | Instruction following | 40% | 49% | 11% | | Overall | 43.6% | 45.8% | 10.6% |

Source: My evaluation, 200 prompts, blind rating by me. Not a large sample, but directionally useful.

GPT-3.5-turbo still has a slight edge overall (45.8% vs 43.6%), especially on summarization and instruction following. But the gap is small enough that for most applications, the difference won't matter. Mixtral wins on creative writing, which surprised me.

What this means

Three things:

The "GPT-3.5 tier" is now commodity. If an open source model matches it on quality and undercuts it 9x on price, the value of GPT-3.5-turbo is basically "convenience." You don't want to manage infrastructure? Pay OpenAI's premium. You're running volume? Run Mixtral.

MoE architecture is the cheat code. Mixtral only uses 12.9B active parameters to match a model that likely has 100B+ active parameters. The mixture-of-experts approach gives you the knowledge of a much larger model at the compute cost of a smaller one. Expect every lab to ship MoE variants in 2024.

The real price war hasn't started yet. If self-hosted open source is already 9x cheaper than API pricing, and hosted providers like Together AI are 2.5x cheaper, there's enormous room for API prices to fall further. OpenAI will have to respond. I predict at least two more major GPT-3.5-turbo price cuts in 2024.

My morning routine now includes checking new Mixtral benchmark results alongside my coffee. This is what kaizen looks like for data nerds: constant, incremental progress in the numbers.

The inference economics have shifted. They're not going back.


If you found this interesting, you might also like:

-- dataku

More from dataku