The cost of running Llama 3.1 405B: cloud vs self-hosted, the full math

Llama 3.1 405B is the first open source model that matches GPT-4. I wrote about the benchmarks last week. This week, let's talk about the part nobody finds sexy but everybody needs: what it actually costs to run.

I priced out four configurations, ran throughput tests on two of them, and built a cost calculator. Here's everything.

Hardware requirements

First, the basics. Llama 3.1 405B has 405 billion parameters. Each parameter in FP16 takes 2 bytes. So the model weights alone need 810GB of memory.

| Precision | Memory needed (weights only) | Memory with KV cache (128K context) | Min GPUs (80GB each) | |-----------|-----------------------------|-----------------------------------------|---------------------| | FP16 | 810GB | ~880GB | 11x A100 80GB | | BF16 | 810GB | ~880GB | 11x A100 80GB | | INT8 | 405GB | ~475GB | 6x A100 80GB | | 4-bit (GPTQ/AWQ) | ~203GB | ~273GB | 4x A100 80GB |

Source: Parameter count from Meta AI Llama 3.1 paper, memory calculations based on standard formulas, KV cache estimates for 128K context at batch size 1.

At FP16, you need at least 11x 80GB GPUs. In practice, 8x A100 or 8x H100 with tensor parallelism and offloading tricks can handle it, but you lose some throughput. With INT8 quantization, 6-8 GPUs. With 4-bit, you can technically fit it on 4 GPUs.

The four configurations I tested/priced

Config 1: AWS (8x A100 80GB via p4d.24xlarge)

| Item | Cost | |------|------| | Instance type | p4d.24xlarge | | GPUs | 8x NVIDIA A100 80GB | | On-demand price | $32.77/hour | | Monthly (24/7) | ~$23,594 | | Monthly (50% utilization) | ~$11,797 | | Throughput (INT8, vLLM) | ~18 tokens/sec (batch 1), ~52 tokens/sec (batch 8) |

Source: AWS pricing page, community throughput reports for 405B on p4d instances.

Config 2: Lambda Labs (8x A100 80GB)

| Item | Cost | |------|------| | GPU type | 8x A100 80GB | | Hourly price | $10.32/hour | | Monthly (24/7) | ~$7,430 | | Monthly (50% utilization) | ~$3,715 | | Throughput (INT8, vLLM) | ~18 tokens/sec (batch 1), ~52 tokens/sec (batch 8) |

Source: Lambda Labs pricing, same throughput as AWS (same GPU).

Config 3: RunPod (8x H100 80GB)

| Item | Cost | |------|------| | GPU type | 8x H100 80GB | | Hourly price | $23.60/hour | | Monthly (24/7) | ~$16,992 | | Monthly (50% utilization) | ~$8,496 | | Throughput (FP16, vLLM) | ~30 tokens/sec (batch 1), ~95 tokens/sec (batch 8) |

Source: RunPod pricing, H100 throughput estimates.

Config 4: RunPod (4x A100 80GB, 4-bit quantized)

| Item | Cost | |------|------| | GPU type | 4x A100 80GB | | Hourly price | $5.80/hour | | Monthly (24/7) | ~$4,176 | | Monthly (50% utilization) | ~$2,088 | | Throughput (4-bit AWQ, vLLM) | ~10 tokens/sec (batch 1), ~28 tokens/sec (batch 8) | | Quality penalty | ~2-4% on most benchmarks |

Source: RunPod pricing, community 4-bit throughput reports.

The cost per million tokens

| Configuration | Monthly cost (24/7) | Tokens/sec (batch 8) | Cost/M output tokens | vs GPT-4o ($15/M) | |--------------|--------------------|--------------------|---------------------|-------------------| | AWS 8xA100 | $23,594 | 52 | $1.74 | 8.6x cheaper | | Lambda Labs 8xA100 | $7,430 | 52 | $0.55 | 27x cheaper | | RunPod 8xH100 | $16,992 | 95 | $0.69 | 22x cheaper | | RunPod 4xA100 (4-bit) | $4,176 | 28 | $0.57 | 26x cheaper | | API: Together AI | Pay per use | N/A | $3.50 | 4.3x cheaper | | API: Fireworks AI | Pay per use | N/A | $3.00 | 5x cheaper |

Source: My calculations. Cost/M tokens = (hourly cost / tokens per second / 3600) * 1,000,000.

Self-hosted on Lambda Labs: $0.55 per million output tokens vs $15 for GPT-4o. That's 27x cheaper for roughly 95% of the quality (based on my evaluation from last week).

But look at the monthly fixed costs. Even the cheapest option (4x A100 on RunPod, 4-bit quantized) costs $4,176/month. You need to be generating a lot of tokens to justify that.

The break-even analysis

At what volume does self-hosting beat the API?

| Comparison | Self-hosted cost (Lambda 8xA100) | API cost | Break-even volume | |-----------|--------------------------------|----------|------------------| | vs GPT-4o ($15/M out) | $7,430/month fixed | $15/M tokens | ~495K output tokens/month | | vs Together AI 405B ($3.50/M) | $7,430/month fixed | $3.50/M tokens | ~2.5M output tokens/month | | vs Fireworks AI 405B ($3.00/M) | $7,430/month fixed | $3.00/M tokens | ~3.0M output tokens/month | | vs GPT-4o mini ($0.60/M out) | $7,430/month fixed | $0.60/M tokens | Never (API always cheaper for mini-class) |

Source: My calculations.

If you're comparing against GPT-4o's $15/M output pricing, self-hosting breaks even at roughly 500K output tokens per month. That's surprisingly low. 500K tokens is about 375 pages of text. If your application generates more than 375 pages of output per month, self-hosting Llama 3.1 405B on Lambda Labs is cheaper than using GPT-4o's API.

Against hosted Llama 3.1 405B APIs ($3.00-3.50/M), the break-even is 2.5-3M tokens/month. That's higher, and many teams won't cross this threshold.

Against GPT-4o mini ($0.60/M): don't bother self-hosting for cost savings. GPT-4o mini is so cheap that the self-hosted 405B model never reaches price parity unless your volume is enormous and you need the quality of a 405B model specifically.

The hidden costs

The numbers above assume you just rent GPUs and run inference. In reality:

| Hidden cost | Monthly estimate | Notes | |------------|-----------------|-------| | DevOps/SRE time | $2,000-8,000 | Someone needs to manage the infrastructure | | Monitoring & alerting | $100-500 | Prometheus, Grafana, PagerDuty | | Load balancing | $200-500 | For production multi-instance setup | | Backup & redundancy | $3,000-15,000 | Second instance for failover | | Model updates | 4-8 hours per update | New Llama versions, vLLM updates |

When you add DevOps time and redundancy, the true cost of self-hosting roughly doubles. A team that thought they were saving money at $7,430/month is actually spending $12,000-20,000/month when you factor in human time and reliability requirements.

My recommendation

| Your situation | Best option | Estimated monthly cost | |---------------|-------------|----------------------| | Under 1M tokens/month | GPT-4o or Claude 3.5 Sonnet API | Under $50 | | 1-10M tokens/month | Hosted Llama 3.1 API (Fireworks/Together) | $3-35 | | 10-100M tokens/month | Self-hosted (8x A100) | $7,430 + ops | | 100M+ tokens/month | Self-hosted (8x H100, multiple instances) | $17,000+ per instance + ops | | Data must stay on-prem | Self-hosted (any config) | $4,176+ | | Budget matters most, quality can be lower | GPT-4o mini API | Under $10 |

The sweet spot for most teams is the hosted API tier. $3.00/M tokens from Fireworks AI gives you GPT-4-class quality with zero infrastructure burden. Self-hosting only makes sense at high volume or when you have strict data residency requirements.

The open source revolution isn't just about quality parity. It's about choice. You can pay $15/M for GPT-4o, $3/M for a hosted open model, or $0.55/M to run it yourself. The same quality, three different price points, three different trade-offs.

My spreadsheet now has a "deployment recommendation" column. It's the most useful column I've added all year.

If you found this interesting, you might also like:

-- dataku