The inference provider market: latency, cost, and uptime for 20 providers
I expanded my monthly monitoring to 20 providers. The new additions: Cerebras, Fireworks, Baseten, Modal, and Replicate. Cerebras leads on latency. Fireworks leads on cost efficiency. Updated rankings inside.
I've been monitoring AI inference providers since late 2024. This month, I expanded from 15 to 20 providers. Sixty days of continuous monitoring data.
Let me show you the full picture.
The 20 providers ranked
| Rank | Provider | Avg latency (TTFT) | Throughput (tokens/s) | Uptime (60 days) | Cost rank | |------|----------|--------------------|-----------------------|-------------------|-----------| | 1 | Cerebras | 82ms | 1,847 t/s | 98.2% | $$$ | | 2 | Groq | 95ms | 812 t/s | 99.1% | $$ | | 3 | Fireworks AI | 142ms | 420 t/s | 99.4% | $ | | 4 | Together AI | 168ms | 380 t/s | 99.0% | $ | | 5 | Anthropic | 245ms | 95 t/s | 99.7% | $$$ | | 6 | OpenAI | 280ms | 88 t/s | 99.3% | $$$ | | 7 | Google AI | 195ms | 210 t/s | 99.1% | $$ | | 8 | Baseten | 210ms | 290 t/s | 98.8% | $$ | | 9 | Modal | 225ms | 310 t/s | 98.6% | $$ | | 10 | Replicate | 310ms | 180 t/s | 98.4% | $$ | | 11 | Perplexity AI | 265ms | 125 t/s | 99.2% | $$ | | 12 | Mistral AI | 230ms | 145 t/s | 98.9% | $$ | | 13 | xAI | 290ms | 110 t/s | 98.1% | $$$ | | 14 | DeepSeek | 340ms | 85 t/s | 97.8% | $ | | 15 | Anyscale | 255ms | 195 t/s | 98.5% | $$ | | 16 | Lepton AI | 280ms | 165 t/s | 97.9% | $ | | 17 | OctoAI | 305ms | 155 t/s | 98.0% | $$ | | 18 | AWS Bedrock | 350ms | 75 t/s | 99.5% | $$$ | | 19 | Azure OpenAI | 320ms | 80 t/s | 99.4% | $$$ | | 20 | Databricks | 380ms | 120 t/s | 98.7% | $$ |
Sources: My monitoring infrastructure, 60-day average (March 20 to May 19, 2025). Latency = time to first token (TTFT) from US East. Throughput = tokens per second for Llama 3.1 70B where available, or equivalent model. Artificial Analysis for cross-reference.
Speed leaders
Cerebras is in a class of its own. 1,847 tokens per second. Their custom wafer-scale chip processes Llama 3.1 70B at 22x the speed of Groq and 21x the speed of Anthropic's API.
| Speed tier | Providers | Tokens/sec range | |-----------|-----------|-----------------| | Ultra-fast (custom silicon) | Cerebras | 1,800+ | | Fast (custom/optimized) | Groq | 800+ | | Medium-fast | Fireworks, Together, Modal, Baseten | 290-420 | | Standard | Most first-party APIs | 75-210 |
The speed difference between Cerebras (1,847 t/s) and AWS Bedrock (75 t/s) is 25x. For latency-sensitive applications, the choice of provider matters enormously.
Reliability leaders
| Provider | Uptime (60 days) | Incidents | Avg incident duration | |----------|-----------------|-----------|---------------------| | Anthropic | 99.7% | 2 | 52 min | | AWS Bedrock | 99.5% | 1 | 78 min | | Azure OpenAI | 99.4% | 2 | 65 min | | Fireworks AI | 99.4% | 3 | 43 min | | OpenAI | 99.3% | 4 | 61 min |
Anthropic leads on uptime at 99.7%. Only 2 incidents in 60 days, averaging 52 minutes each.
The cloud giants (AWS, Azure) are close behind, which makes sense given their infrastructure. Fireworks is impressively reliable for a smaller provider.
Cost efficiency (for open source models)
For providers hosting open source models (Llama 3.1 70B), the cost comparison:
| Provider | Input/M tokens | Output/M tokens | Speed (t/s) | Cost per 1M output at speed | |----------|---------------|-----------------|-------------|---------------------------| | Fireworks AI | $0.20 | $0.90 | 420 | Best value | | Together AI | $0.20 | $0.88 | 380 | Close second | | Groq | $0.27 | $0.27 | 812 | Best for speed | | Cerebras | $0.60 | $0.60 | 1,847 | Premium speed | | Replicate | $0.32 | $0.65 | 180 | Average |
Sources: Provider pricing pages, May 2025.
Fireworks and Together are nearly tied on cost. Groq offers a good speed-to-cost ratio (its output pricing at $0.27/M is actually cheaper than Fireworks). Cerebras charges more but delivers 4x the speed of Groq.
My takeaway
Nobody wins on all three dimensions (speed, reliability, cost). The best choice depends on your priority:
| Priority | Best provider | |---------|--------------| | Raw speed | Cerebras | | Reliability + quality | Anthropic (own models) | | Cost for open source | Fireworks / Together | | Speed + reasonable cost | Groq | | Enterprise compliance | AWS Bedrock / Azure |
The inference provider market in 2025 looks like the cloud market did in 2015: fragmented, rapidly evolving, and heading toward consolidation. I expect 3-5 winners in each category by 2026.
Sixty days of monitoring, 20 providers, 1.2 million data points. My monitoring bill: $47/month. The data it produces is worth significantly more.
If you found this interesting, you might also like:
- 5 charts that explain why GPU prices went insane in 2021
- The training cost curve is doing something weird
- Groq's LPU just served me 800 tokens per second. The inference speed data.
- The state of AI APIs: speed, cost, and reliability across 15 providers
- AI research papers published in 2021: a mid-year count
-- dataku