The cost of self-hosting vs API: a real comparison for Llama 2

Everyone's talking about running Llama 2 on their own hardware. But nobody's showing the full cost math. So I did the math.

The question is simple: at what usage volume does self-hosting Llama 2 become cheaper than using the OpenAI API?

The options

I'm comparing four setups, from easiest to hardest:

| Option | What it is | Setup effort | Min cost/month | |--------|-----------|-------------|---------------| | OpenAI GPT-3.5-turbo API | Managed API, pay per token | None | $0 (pay per use) | | Together AI Llama 2 70B | Third-party hosted Llama 2 | API key signup | $0 (pay per use) | | RunPod cloud GPU | Rent a GPU, host yourself | Medium (deploy model) | ~$800 | | Lambda Labs dedicated | Dedicated GPU server | High (full setup) | ~$1,400 |

Let me break down the per-token economics for each.

Per-token cost comparison

For Llama 2 70B-chat at roughly 1,000-2,000 tokens/second throughput on an A100-80GB:

| Option | Cost per 1M input tokens | Cost per 1M output tokens | Notes | |--------|------------------------|-------------------------|-------| | GPT-3.5-turbo | $2.00 | $2.00 | Same price in/out | | Together AI Llama 2 70B | $0.90 | $0.90 | Cheapest managed option | | RunPod A100-80GB | ~$0.40 | ~$0.40 | Based on $1.64/hr, ~2,500 tok/s | | Lambda A100-80GB | ~$0.35 | ~$0.35 | Based on $1.10/hr reserved | | Self-owned A100 | ~$0.08 | ~$0.08 | Amortized over 3 years |

Sources: OpenAI pricing, Together AI pricing, RunPod pricing, Lambda Labs pricing, vLLM throughput benchmarks.

The self-owned A100 number is based on buying a used A100-80GB for roughly $12,000, amortizing over 3 years, plus ~$50/month in electricity. It's the cheapest per token, but the upfront cost and maintenance are significant.

The break-even analysis

Now the interesting part. At what monthly volume does each option become cheaper than GPT-3.5-turbo?

| Option | Fixed monthly cost | Variable $/1M tokens | Break-even vs GPT-3.5-turbo | |--------|-------------------|---------------------|----------------------------| | GPT-3.5-turbo | $0 | $2.00 | Baseline | | Together AI | $0 | $0.90 | Immediately (always cheaper) | | RunPod A100 | ~$800* | $0.40 | ~500M tokens/month | | Lambda A100 | ~$1,400* | $0.35 | ~848M tokens/month |

*Monthly cost for a single dedicated GPU. RunPod spot pricing can be lower; Lambda requires monthly commitment.

Wait. Let me recalculate that. If RunPod costs $800/month fixed and saves $1.60 per million tokens vs GPT-3.5:

Break-even = $800 / $1.60 = 500M tokens/month

500 million tokens per month. That's about 16.7 million tokens per day, or roughly 8,000 full-length API calls (2,000 tokens each) per day.

For a startup running a chatbot? You'd need to be serving thousands of active daily users to hit that volume. For a solo developer or small team, the break-even is too high to justify dedicated hardware.

But for a mid-size company already spending $1,000+ per month on OpenAI? The math flips fast.

The realistic scenarios

Let me model three real-world usage patterns:

Solo developer (personal projects, prototyping)

| Volume | GPT-3.5 cost | Together AI | RunPod | Best option | |--------|-------------|-------------|--------|-------------| | 10M tokens/month | $20 | $9 | $800+ | Together AI | | 50M tokens/month | $100 | $45 | $800+ | Together AI | | 100M tokens/month | $200 | $90 | $800+ | Together AI |

For solo use, a hosted Llama 2 API is the clear winner. You save 50-55% vs OpenAI with zero infrastructure work.

Startup (product in production)

| Volume | GPT-3.5 cost | Together AI | RunPod | Best option | |--------|-------------|-------------|--------|-------------| | 500M tokens/month | $1,000 | $450 | $800 | RunPod | | 1B tokens/month | $2,000 | $900 | $800 | RunPod | | 5B tokens/month | $10,000 | $4,500 | $800 (need 2-3 GPUs) | RunPod (2-3 GPUs) |

At 500M+ tokens/month, self-hosting on a cloud GPU starts winning. But you need someone on the team who can manage GPU infrastructure, handle model updates, and deal with downtime.

Enterprise (high volume)

| Volume | GPT-3.5 cost | Self-hosted (owned) | Savings | |--------|-------------|-------------------|---------| | 10B tokens/month | $20,000 | ~$1,200 (4 GPUs) | 94% | | 50B tokens/month | $100,000 | ~$5,000 (16 GPUs) | 95% |

At enterprise scale, self-hosting is overwhelmingly cheaper. The savings fund an entire infrastructure team and still leave money on the table.

The hidden costs I almost forgot

Running your own LLM isn't just GPU rental. There's overhead:

| Hidden cost | Impact | Who pays for this | |------------|--------|-------------------| | Setup time | 4-20 hours to get a model serving reliably | Your engineering time | | Monitoring | Need to track latency, errors, GPU utilization | DevOps tooling | | Scaling | Can't auto-scale like an API; need to provision GPUs | Over-provisioning cost | | Updates | New model versions require manual deployment | Engineering time | | Downtime | No SLA; if your GPU crashes at 2am, it's your problem | On-call engineers | | vLLM/TGI setup | Optimized serving requires specific inference engines | Engineering time |

For a team of 3 engineers, I'd estimate the hidden cost at 10-20 hours/month of engineering time. At $100/hour loaded cost, that's $1,000-$2,000/month in labor. This raises the effective break-even point significantly.

My recommendation

| Your situation | Best option | Why | |----------------|------------|-----| | Under 100M tokens/month | Together AI Llama 2 API | Cheapest, zero maintenance | | 100M-500M tokens/month | Together AI or consider RunPod | Depends on team capability | | 500M-5B tokens/month | Cloud GPU (RunPod/Lambda) | Clear cost advantage | | Over 5B tokens/month | Self-hosted hardware | Massive savings at scale |

The break-even point is lower than I expected. I thought you'd need billions of tokens per month before self-hosting made sense. For a cloud GPU setup, the crossover happens around 500M tokens/month, which is achievable for any startup with a production LLM product.

The real barrier isn't cost. It's capability. If your team can manage GPU infrastructure, self-hosting Llama 2 saves real money. If they can't, the API premium is well worth it.

If you found this interesting, you might also like:

-- dataku

The cost of self-hosting vs API: a real comparison for Llama 2

The options

Per-token cost comparison

The break-even analysis

The realistic scenarios

The hidden costs I almost forgot

My recommendation

More from dataku

The inference cost collapse, in one chart

The AI API price tracker: 5 years of data in one interactive chart

Every AI pricing change in Q4 2025, tracked