Pricing WatchAugust 28, 20235 min read

The real cost of training Llama 2: Meta's numbers vs my estimates

Meta says Llama 2 70B used 1.7M GPU hours of A100 time. At current cloud prices, that's roughly $5.4M. But Meta used their own hardware. I estimated the real cost and it's probably 60% less.

The Llama 2 paper gives us more training details than almost any other frontier model paper. Including GPU hours. That means I can estimate the actual training cost, and compare it to what you or I would pay on cloud.

Let me do the math.

Meta's published numbers

From the Llama 2 paper, Table 16:

| Model | Parameters | GPU type | GPU hours | Training tokens | |-------|-----------|----------|-----------|----------------| | Llama 2 7B | 6.7B | A100-80GB | 184,320 | 2.0T | | Llama 2 13B | 13.0B | A100-80GB | 368,640 | 2.0T | | Llama 2 34B | 33.7B | A100-80GB | 1,038,336 | 2.0T | | Llama 2 70B | 65.2B | A100-80GB | 1,720,320 | 2.0T |

Source: Llama 2 paper, Section 2 and Appendix.

1.72 million GPU hours for the 70B model. That's an impressively large number, and an impressively precise one. Meta doesn't normally share this kind of detail.

Cloud cost estimate

If you rented A100-80GB GPUs from major cloud providers:

| Provider | $/hour (A100-80GB) | Llama 2 70B cost | Source | |----------|-------------------|--------------------|--------| | AWS p4d.24xlarge (8x A100) | $3.06/GPU* | $5.26M | AWS on-demand pricing | | Google Cloud a2-ultragpu-8g | $3.54/GPU* | $6.09M | GCP on-demand pricing | | Lambda Labs | $1.10/GPU | $1.89M | Lambda Labs pricing | | CoreWeave | $2.06/GPU | $3.54M | CoreWeave pricing |

*AWS and GCP pricing is per-instance, divided by 8 GPUs. Prices as of August 2023.

At full AWS on-demand pricing, Llama 2 70B would cost $5.26 million to train. On Lambda Labs (the cheapest major provider), $1.89 million.

But Meta didn't use cloud GPUs.

What Meta actually paid

Meta owns their GPU infrastructure. Their capital expenditure reports and data center filings give us enough information to estimate the true cost:

The hardware cost:

  • Meta reportedly has over 21,000 A100 GPUs in their Research SuperCluster (RSC)
  • An A100-80GB costs roughly $15,000-$20,000 wholesale
  • Amortized over 3-4 years of useful life: ~$450-$570/month per GPU, or ~$0.62-$0.78/hour

The electricity cost:

  • A100-80GB TDP: 400W
  • With cooling overhead (PUE ~1.1 for Meta data centers): ~440W
  • At $0.06/kWh (Meta's approximate rate): $0.026/hour per GPU

The networking/facility cost:

  • Data center overhead: roughly $0.10-$0.15/hour per GPU (staff, maintenance, networking)

Adding it up:

| Cost component | $/hour per GPU | % of total | |---------------|---------------|------------| | Hardware (amortized) | $0.62-$0.78 | ~68% | | Electricity | $0.026 | ~3% | | Facility/overhead | $0.10-$0.15 | ~13% | | Engineering time | $0.10-$0.15 | ~13% | | Total | $0.85-$1.10 | 100% |

Sources: NVIDIA GPU pricing, Meta 10-K filings (capex data), SemiAnalysis data center cost models, US industrial electricity rates.

So Meta's real cost is roughly $0.85-$1.10 per GPU hour. Compare that to $3.06 on AWS.

The real training cost

| Model | GPU hours | Meta's cost (est.) | AWS cost | Savings | |-------|-----------|-------------------|----------|---------| | Llama 2 7B | 184,320 | $160K-$203K | $564K | 62-72% | | Llama 2 13B | 368,640 | $313K-$406K | $1.13M | 64-72% | | Llama 2 34B | 1,038,336 | $883K-$1.14M | $3.18M | 64-72% | | Llama 2 70B | 1,720,320 | $1.46M-$1.89M | $5.26M | 64-72% |

My best estimate for Llama 2 70B: $1.5-$1.9 million. That's about 60-70% less than what it would cost on AWS.

Roughly $2 million to train one of the best open source models ever released. For a company that spent $32 billion on capex in 2022, this is basically a rounding error.

How this compares to other models

| Model | Estimated training cost | Source | |-------|------------------------|--------| | GPT-3 (175B) | $4.6M (cloud prices) | OpenAI paper, Lambda Labs estimate | | PaLM (540B) | $10-15M (Google internal) | Community estimates, Google hardware | | Chinchilla (70B) | $2-3M (DeepMind internal) | DeepMind paper, estimated | | Llama 2 70B | $1.5-1.9M (Meta internal) | My estimate above | | Llama 2 70B | $5.3M (cloud, hypothetical) | AWS pricing | | BLOOM (176B) | $2-5M (donated compute) | BigScience project reports |

Sources: Model papers on arXiv, Epoch AI compute trends, SemiAnalysis analysis, my estimates.

Llama 2 70B is one of the cheapest frontier-competitive models ever trained, relative to its performance. $1.5-1.9M for a model that matches GPT-3.5 on most benchmarks. The value proposition of open sourcing it makes sense when you realize the training cost was negligible by Meta standards.

Why this matters

For startups: Training your own Llama 2 competitor from scratch would cost $1.9M on Lambda Labs, $5.3M on AWS. That's achievable for a well-funded startup, but it's not cheap. Fine-tuning Llama 2 (which is what most people should do) costs a tiny fraction of that, often under $10,000.

For the market: Meta can afford to give away a $2M model because the strategic value (developer community, reduced dependence on OpenAI, positive press) exceeds the cost. Google, Amazon, and Microsoft can all afford the same math. Expect more "free" models from big tech.

For the industry: The $2M number will keep dropping. Better training techniques, more efficient architectures, cheaper hardware. By 2024, I expect a GPT-3.5-equivalent model to cost under $500K to train from scratch.

The cost curve is heading in one direction. And when training a good model costs less than a senior engineer's annual salary, the dynamics of the entire industry change.


If you found this interesting, you might also like:

-- dataku

More from dataku