Pricing WatchFebruary 14, 20225 min read

The cost of running an AI startup in 2022: a data breakdown

I surveyed 23 AI startup founders about their cloud compute bills. The median monthly GPU spend is $14,000. One is paying $200,000/month. The variance is absurd.

I spent January emailing AI startup founders and asking them an impolite question: how much do you spend on GPUs?

23 of them answered. The numbers are all over the place, and that's exactly what makes them interesting.

The survey

I reached out to 61 founders of AI startups (seed to Series B) through a mix of Twitter DMs, Discord communities, and cold emails. 23 responded with specific numbers. All responses are anonymized. Nobody wanted their investors seeing these figures next to their company name.

The question was simple: "What is your total monthly spend on cloud compute (GPU/TPU instances) for model training and inference, as of January 2022?"

The distribution

| Metric | Amount | |--------|--------| | Minimum | $800/month | | 25th percentile | $4,200/month | | Median | $14,000/month | | 75th percentile | $38,000/month | | Maximum | $200,000/month | | Mean | $29,400/month |

The mean is more than double the median. That's how skewed this distribution is. Two companies at the top ($200K and $120K/month) are pulling the average way up. Most AI startups are spending between $4K and $40K monthly on compute.

But that "most" hides an important detail: the type of AI work you're doing determines your bill more than your company size.

Spend by startup type

| Startup type | Count | Median monthly spend | Typical GPU | |-------------|-------|---------------------|-------------| | LLM-based products (API) | 6 | $8,500 | N/A (use OpenAI/Cohere API) | | Computer vision | 5 | $18,000 | A100 40GB | | Custom model training | 4 | $52,000 | A100 80GB clusters | | NLP (non-LLM) | 4 | $6,200 | V100, T4 | | Speech/audio | 2 | $22,000 | A100 40GB | | Multi-modal | 2 | $160,000 | A100 80GB clusters |

The startups building on top of existing APIs (like OpenAI or Cohere) spend the least on compute. Their cost is API calls, not hardware. The $200K/month outlier is training a custom multi-modal model from scratch.

Where the money goes

I asked respondents to break down their compute spend by category.

| Category | Median share of spend | |----------|----------------------| | Model training | 45% | | Inference (production) | 30% | | Experimentation/dev | 15% | | Data processing/ETL | 10% |

Training dominates, but production inference is catching up for the more mature startups. The two companies spending over $100K/month both said inference is now their largest cost because they have real users generating real requests.

Cloud provider breakdown

| Provider | # of respondents using | Median spend on that provider | |----------|----------------------|------------------------------| | AWS | 14 | $16,000 | | Google Cloud | 8 | $12,000 | | Azure | 5 | $21,000 | | Lambda Labs | 4 | $5,600 | | CoreWeave | 3 | $8,000 |

Most startups use more than one provider (the numbers above add up to more than 23). AWS is the most common, but Google Cloud users reported slightly lower costs for similar workloads, likely because of TPU pricing.

Lambda Labs and CoreWeave are the interesting ones. They're GPU-cloud specialists that offer A100 instances at 30-40% below AWS pricing. Four respondents have moved their training workloads there while keeping inference on AWS.

The $200K/month founder

I asked the highest-spending founder (call them Founder X) for more detail. Here's their rough breakdown:

| Line item | Monthly cost | |-----------|-------------| | 8x A100 80GB cluster (training) | $96,000 | | 4x A100 40GB instances (inference) | $48,000 | | Storage (multi-modal datasets) | $22,000 | | Networking/egress | $18,000 | | Dev/experimentation instances | $16,000 | | Total | $200,000 |

Their burn rate on compute alone is $2.4M/year. They raised a $12M Series A. That means 20% of their entire fundraise goes to GPU rental. Founder X's exact words were "the math is terrifying but the model is working."

What surprised me

Three things I didn't expect from the data.

First, networking and egress costs. Multiple respondents flagged data transfer fees as a hidden budget killer. Moving large training datasets between regions or providers costs more than people plan for. One founder said egress fees added 12% on top of their compute bill.

Second, the API-based startups are spending more than I expected on OpenAI's API. The median was $8,500/month, but that's pure inference cost with zero infrastructure management. For early-stage companies, that trade-off makes sense. You're paying a premium per token but saving on DevOps headcount.

Third, nobody is using spot instances for training as much as the cloud providers want them to. Only 3 of 23 respondents use spot/preemptible instances for training workloads. The reason: training runs get interrupted, and restoring from checkpoints wastes time and money. Most prefer paying full price for reliability.

The uncomfortable bottom line

If you're starting an AI company in 2022 and training your own models, plan for $10K-50K/month in compute from day one. If you're building on top of APIs, plan for $3K-15K/month once you have real users.

These numbers will change. Compute costs drop over time, new GPUs launch, competition between cloud providers increases. But right now, in February 2022, this is what the data says.

The GPU bill is the new rent.


If you found this interesting, you might also like:

-- dataku

More from dataku