The cost of self-hosting vs API: a real comparison for Llama 2
Can you actually save money running Llama 2 yourself instead of using the OpenAI API? I calculated it. The answer depends on your volume, but the break-even point is lower than I expected.
Everyone's talking about running Llama 2 on their own hardware. But nobody's showing the full cost math. So I did the math.
The question is simple: at what usage volume does self-hosting Llama 2 become cheaper than using the OpenAI API?
The options
I'm comparing four setups, from easiest to hardest:
| Option | What it is | Setup effort | Min cost/month | |--------|-----------|-------------|---------------| | OpenAI GPT-3.5-turbo API | Managed API, pay per token | None | $0 (pay per use) | | Together AI Llama 2 70B | Third-party hosted Llama 2 | API key signup | $0 (pay per use) | | RunPod cloud GPU | Rent a GPU, host yourself | Medium (deploy model) | ~$800 | | Lambda Labs dedicated | Dedicated GPU server | High (full setup) | ~$1,400 |
Let me break down the per-token economics for each.
Per-token cost comparison
For Llama 2 70B-chat at roughly 1,000-2,000 tokens/second throughput on an A100-80GB:
| Option | Cost per 1M input tokens | Cost per 1M output tokens | Notes | |--------|------------------------|-------------------------|-------| | GPT-3.5-turbo | $2.00 | $2.00 | Same price in/out | | Together AI Llama 2 70B | $0.90 | $0.90 | Cheapest managed option | | RunPod A100-80GB | ~$0.40 | ~$0.40 | Based on $1.64/hr, ~2,500 tok/s | | Lambda A100-80GB | ~$0.35 | ~$0.35 | Based on $1.10/hr reserved | | Self-owned A100 | ~$0.08 | ~$0.08 | Amortized over 3 years |
Sources: OpenAI pricing, Together AI pricing, RunPod pricing, Lambda Labs pricing, vLLM throughput benchmarks.
The self-owned A100 number is based on buying a used A100-80GB for roughly $12,000, amortizing over 3 years, plus ~$50/month in electricity. It's the cheapest per token, but the upfront cost and maintenance are significant.
The break-even analysis
Now the interesting part. At what monthly volume does each option become cheaper than GPT-3.5-turbo?
| Option | Fixed monthly cost | Variable $/1M tokens | Break-even vs GPT-3.5-turbo | |--------|-------------------|---------------------|----------------------------| | GPT-3.5-turbo | $0 | $2.00 | Baseline | | Together AI | $0 | $0.90 | Immediately (always cheaper) | | RunPod A100 | ~$800* | $0.40 | ~500M tokens/month | | Lambda A100 | ~$1,400* | $0.35 | ~848M tokens/month |
*Monthly cost for a single dedicated GPU. RunPod spot pricing can be lower; Lambda requires monthly commitment.
Wait. Let me recalculate that. If RunPod costs $800/month fixed and saves $1.60 per million tokens vs GPT-3.5:
Break-even = $800 / $1.60 = 500M tokens/month
500 million tokens per month. That's about 16.7 million tokens per day, or roughly 8,000 full-length API calls (2,000 tokens each) per day.
For a startup running a chatbot? You'd need to be serving thousands of active daily users to hit that volume. For a solo developer or small team, the break-even is too high to justify dedicated hardware.
But for a mid-size company already spending $1,000+ per month on OpenAI? The math flips fast.
The realistic scenarios
Let me model three real-world usage patterns:
Solo developer (personal projects, prototyping)
| Volume | GPT-3.5 cost | Together AI | RunPod | Best option | |--------|-------------|-------------|--------|-------------| | 10M tokens/month | $20 | $9 | $800+ | Together AI | | 50M tokens/month | $100 | $45 | $800+ | Together AI | | 100M tokens/month | $200 | $90 | $800+ | Together AI |
For solo use, a hosted Llama 2 API is the clear winner. You save 50-55% vs OpenAI with zero infrastructure work.
Startup (product in production)
| Volume | GPT-3.5 cost | Together AI | RunPod | Best option | |--------|-------------|-------------|--------|-------------| | 500M tokens/month | $1,000 | $450 | $800 | RunPod | | 1B tokens/month | $2,000 | $900 | $800 | RunPod | | 5B tokens/month | $10,000 | $4,500 | $800 (need 2-3 GPUs) | RunPod (2-3 GPUs) |
At 500M+ tokens/month, self-hosting on a cloud GPU starts winning. But you need someone on the team who can manage GPU infrastructure, handle model updates, and deal with downtime.
Enterprise (high volume)
| Volume | GPT-3.5 cost | Self-hosted (owned) | Savings | |--------|-------------|-------------------|---------| | 10B tokens/month | $20,000 | ~$1,200 (4 GPUs) | 94% | | 50B tokens/month | $100,000 | ~$5,000 (16 GPUs) | 95% |
At enterprise scale, self-hosting is overwhelmingly cheaper. The savings fund an entire infrastructure team and still leave money on the table.
The hidden costs I almost forgot
Running your own LLM isn't just GPU rental. There's overhead:
| Hidden cost | Impact | Who pays for this | |------------|--------|-------------------| | Setup time | 4-20 hours to get a model serving reliably | Your engineering time | | Monitoring | Need to track latency, errors, GPU utilization | DevOps tooling | | Scaling | Can't auto-scale like an API; need to provision GPUs | Over-provisioning cost | | Updates | New model versions require manual deployment | Engineering time | | Downtime | No SLA; if your GPU crashes at 2am, it's your problem | On-call engineers | | vLLM/TGI setup | Optimized serving requires specific inference engines | Engineering time |
For a team of 3 engineers, I'd estimate the hidden cost at 10-20 hours/month of engineering time. At $100/hour loaded cost, that's $1,000-$2,000/month in labor. This raises the effective break-even point significantly.
My recommendation
| Your situation | Best option | Why | |----------------|------------|-----| | Under 100M tokens/month | Together AI Llama 2 API | Cheapest, zero maintenance | | 100M-500M tokens/month | Together AI or consider RunPod | Depends on team capability | | 500M-5B tokens/month | Cloud GPU (RunPod/Lambda) | Clear cost advantage | | Over 5B tokens/month | Self-hosted hardware | Massive savings at scale |
The break-even point is lower than I expected. I thought you'd need billions of tokens per month before self-hosting made sense. For a cloud GPU setup, the crossover happens around 500M tokens/month, which is achievable for any startup with a production LLM product.
The real barrier isn't cost. It's capability. If your team can manage GPU infrastructure, self-hosting Llama 2 saves real money. If they can't, the API premium is well worth it.
If you found this interesting, you might also like:
- Stable Diffusion is free. The pricing math of open source image generation.
- GPT-4 is 10x more expensive than GPT-3.5. Is it 10x better?
- Wait, GPT-3 costs HOW much per token?
- Codex and the cost of code generation: my first pricing analysis
- The cost of running an AI startup in 2022: a data breakdown
-- dataku