Data StoriesNovember 4, 20247 min read

The state of AI APIs: speed, cost, and reliability across 15 providers

I monitored 15 AI API providers for 30 days straight, logging latency, error rates, and uptime. The results are a mess. Anthropic has the best uptime. Groq has the best speed. Nobody has both.

For the past 30 days, I've had a cron job pinging 15 AI API providers every 5 minutes with a standardized prompt. That's 8,640 requests per provider. 129,600 requests total.

I measured three things: time to first token (TTFT), tokens per second, and error rate. Here's every number.

The methodology

Every 5 minutes, my script sends the same prompt to every provider: "Explain the concept of entropy in information theory in 200 words." I log:

  • Time to first token (TTFT): how long until the first byte of the response
  • Tokens per second: output generation speed
  • HTTP status code: 200 = success, anything else = error
  • Total response time: end-to-end latency

30 days. October 4 to November 3, 2024. From a server in Virginia (us-east-1).

The speed rankings

| Rank | Provider | Model tested | Median TTFT | Median tok/sec | p99 TTFT | |------|----------|-------------|------------|---------------|---------| | 1 | Groq | Llama 3.1 70B | 28ms | 340 | 89ms | | 2 | Fireworks AI | Llama 3.1 70B | 112ms | 142 | 342ms | | 3 | Cerebras | Llama 3.1 70B | 45ms | 580 | 128ms | | 4 | Together AI | Llama 3.1 70B | 156ms | 98 | 487ms | | 5 | Anthropic | Claude 3.5 Sonnet | 324ms | 82 | 1,240ms | | 6 | OpenAI | GPT-4o | 289ms | 78 | 890ms | | 7 | Google | Gemini 1.5 Flash | 198ms | 145 | 678ms | | 8 | Mistral AI | Mistral Large 2 | 412ms | 62 | 1,890ms | | 9 | Perplexity | pplx-70b | 234ms | 68 | 756ms | | 10 | Anyscale | Llama 3.1 70B | 267ms | 76 | 1,120ms | | 11 | OpenAI | GPT-4o mini | 178ms | 128 | 534ms | | 12 | Replicate | Llama 3.1 70B | 1,450ms | 48 | 4,200ms | | 13 | Baseten | Llama 3.1 70B | 890ms | 56 | 3,100ms | | 14 | Modal | Llama 3.1 70B | 2,100ms | 42 | 6,800ms | | 15 | Anthropic | Claude 3 Haiku | 142ms | 168 | 456ms |

Source: My 30-day monitoring, 8,640 requests per provider, October-November 2024.

Groq and Cerebras are in a different league on speed. Groq at 340 tok/sec and Cerebras at 580 tok/sec are 4-14x faster than the big three (OpenAI, Anthropic, Google). The custom silicon inference companies are proving their thesis.

Replicate and Modal have high cold-start latency (1,450ms and 2,100ms TTFT) because they spin up GPU instances on demand. Once warm, they're competitive. But that initial latency is painful for interactive use.

The reliability rankings

| Rank | Provider | Error rate (30 days) | Estimated uptime | Worst single day | Errors caused by | |------|----------|---------------------|-----------------|-----------------|-----------------| | 1 | Anthropic | 0.12% | 99.88% | 0.8% errors | Rate limits | | 2 | OpenAI (GPT-4o) | 0.34% | 99.66% | 2.1% errors | 503s, rate limits | | 3 | Google | 0.41% | 99.59% | 1.8% errors | Quota exceeded | | 4 | Fireworks AI | 0.52% | 99.48% | 3.2% errors | 429s, timeouts | | 5 | Together AI | 0.68% | 99.32% | 4.1% errors | 502s, model loading | | 6 | Claude 3 Haiku | 0.18% | 99.82% | 0.6% errors | Rate limits | | 7 | Groq | 1.24% | 98.76% | 8.7% errors | Capacity limits | | 8 | Mistral AI | 0.89% | 99.11% | 5.4% errors | 500s, timeouts | | 9 | Perplexity | 0.74% | 99.26% | 3.8% errors | Rate limits | | 10 | Anyscale | 0.92% | 99.08% | 6.2% errors | 502s | | 11 | Replicate | 1.56% | 98.44% | 9.1% errors | Cold starts, timeouts | | 12 | Baseten | 1.82% | 98.18% | 11.2% errors | Cold starts | | 13 | Modal | 2.34% | 97.66% | 14.5% errors | Cold starts, capacity | | 14 | Cerebras | 1.41% | 98.59% | 7.8% errors | Capacity limits | | 15 | GPT-4o mini | 0.28% | 99.72% | 1.4% errors | Rate limits |

Source: My 30-day monitoring data.

Anthropic has the best uptime at 99.88%. Their worst day was 0.8% error rate. That's remarkably consistent. OpenAI at 99.66% is also solid but had a couple of days with 2%+ errors.

Groq (1.24% error rate) trades reliability for speed. On one day, 8.7% of my requests failed due to capacity limits. The fastest provider is not the most reliable.

The "nobody has both" problem

Let me visualize the speed vs reliability trade-off:

| Category | Provider | Speed tier | Reliability tier | |----------|----------|-----------|-----------------| | Fast + reliable | (nobody) | - | - | | Fast + unreliable | Groq, Cerebras | Top tier | Below 99% | | Slow + reliable | Anthropic, OpenAI | Mid tier | 99.5%+ | | Balanced | Fireworks AI, Google | Good | 99.5% | | Slow + unreliable | Modal, Replicate, Baseten | Low tier | Below 99% |

No provider delivers both top-tier speed AND top-tier reliability. The closest is Fireworks AI: 142 tok/sec (good, not great) with 99.48% uptime (solid).

The cost picture (included for completeness)

| Provider | Model | $/M output tokens | Speed (tok/sec) | Uptime | Value rating | |----------|-------|-------------------|----------------|--------|-------------| | Groq | Llama 3.1 70B | $0.79 | 340 | 98.76% | Fast + cheap | | Fireworks AI | Llama 3.1 70B | $0.90 | 142 | 99.48% | Balanced | | Together AI | Llama 3.1 70B | $0.88 | 98 | 99.32% | Reliable + cheap | | OpenAI | GPT-4o | $15.00 | 78 | 99.66% | Premium quality | | Anthropic | Claude 3.5 Sonnet | $15.00 | 82 | 99.88% | Best reliability | | Anthropic | Claude 3 Haiku | $1.25 | 168 | 99.82% | Budget reliable | | Google | Gemini 1.5 Flash | $0.30 | 145 | 99.59% | Cheapest quality | | Cerebras | Llama 3.1 70B | Pay per token | 580 | 98.59% | Fastest raw speed |

Source: My 30-day data, provider pricing pages, November 2024.

What I'd recommend

| Your priority | Best provider | Why | |--------------|--------------|-----| | Maximum reliability | Anthropic (Claude 3.5 Sonnet or Haiku) | 99.82-99.88% uptime, consistent latency | | Maximum speed | Cerebras or Groq | 340-580 tok/sec, accept some downtime | | Best balance | Fireworks AI | 142 tok/sec, 99.48% uptime, $0.90/M | | Cheapest quality | Google Gemini 1.5 Flash | $0.30/M output, fast, 99.59% uptime | | Production critical | Multi-provider with fallback | Route to primary, fall back to secondary on error |

For production applications, I'd use a multi-provider setup: primary on Anthropic or OpenAI, fallback to Fireworks AI or Together AI with an open source model. That gives you 99.99%+ effective uptime.

129,600 requests later, my cron job has earned its keep. The data paints a messy picture: nobody has solved speed + reliability + cost simultaneously. But the options are getting better every month, and my spreadsheet is ready for the next 30-day monitoring window.


If you found this interesting, you might also like:

-- dataku

More from dataku