AI API uptime in H1 2025: the reliability report
Six months of continuous monitoring across 15 API providers. Anthropic: 99.7% uptime. OpenAI: 99.3%. Google: 99.1%. The outage patterns are interesting. Mondays and Thursdays are the worst days. I have theories about why.
I've been running automated uptime checks against 15 AI API providers every 5 minutes since January 1, 2025. That's 52,560 checks per provider over six months.
Here's the full reliability report.
H1 2025 uptime rankings
| Rank | Provider | Uptime | Total downtime | Incidents | Avg incident length | |------|----------|--------|---------------|-----------|-------------------| | 1 | Anthropic | 99.72% | 12.3 hrs | 8 | 92 min | | 2 | AWS Bedrock | 99.58% | 18.4 hrs | 5 | 221 min | | 3 | Fireworks AI | 99.52% | 21.1 hrs | 12 | 105 min | | 4 | Azure OpenAI | 99.47% | 23.3 hrs | 7 | 200 min | | 5 | OpenAI | 99.31% | 30.3 hrs | 18 | 101 min | | 6 | Google AI | 99.14% | 37.8 hrs | 14 | 162 min | | 7 | Together AI | 99.08% | 40.4 hrs | 16 | 152 min | | 8 | Groq | 98.94% | 46.6 hrs | 11 | 254 min | | 9 | Mistral AI | 98.87% | 49.6 hrs | 13 | 229 min | | 10 | Perplexity AI | 98.81% | 52.3 hrs | 9 | 349 min |
Sources: My monitoring infrastructure, 5-minute intervals, January 1 to June 30, 2025. Status pages: status.anthropic.com, status.openai.com, status.cloud.google.com.
Anthropic leads at 99.72%. That's 12.3 hours of total downtime in six months, across 8 incidents. For a production API, that's strong.
OpenAI at 99.31% had 18 separate incidents. More frequent but individually shorter. Their average incident lasts 101 minutes vs Anthropic's 92 minutes.
The day-of-week pattern
This was unexpected. Outages cluster on specific days:
| Day | Incidents (all providers) | Percentage | |-----|--------------------------|-----------| | Monday | 28 | 21% | | Tuesday | 14 | 11% | | Wednesday | 12 | 9% | | Thursday | 24 | 18% | | Friday | 18 | 14% | | Saturday | 16 | 12% | | Sunday | 20 | 15% |
Mondays and Thursdays account for 39% of all incidents, despite being 29% of the week.
My theory on Mondays: engineers deploy over the weekend (lower traffic), and issues surface Monday morning when load spikes. Thursday outages might correlate with end-of-sprint deploys (many teams run Thursday release cycles).
Sunday incidents (15%) are higher than I expected. My guess: batch processing jobs that run on weekends sometimes cause resource contention.
Time-of-day pattern (US Eastern)
| Time window (ET) | Incidents | Percentage | |-----------------|-----------|-----------| | 6am-10am | 31 | 24% | | 10am-2pm | 38 | 29% | | 2pm-6pm | 22 | 17% | | 6pm-10pm | 18 | 14% | | 10pm-6am | 23 | 18% |
The 10am-2pm window has the most outages (29%). This is peak US usage hours. Systems are under maximum load, and that's when capacity limits get hit.
Error types
| Error type | Percentage of incidents | |-----------|----------------------| | 5xx server errors | 42% | | Rate limiting (429) | 28% | | Timeout (>30s response) | 18% | | Connection refused | 8% | | Other | 4% |
Rate limiting (429 errors) accounts for 28% of incidents. This isn't "downtime" in the traditional sense. The API is working. It's just refusing to serve your request because you've hit a limit.
Whether 429s count as "downtime" depends on your perspective. For a user who can't get a response, it feels like downtime.
Provider comparison: incidents vs duration
| Provider | Short incidents (<1hr) | Long incidents (>3hr) | Longest single incident | |----------|----------------------|---------------------|----------------------| | Anthropic | 6 | 0 | 2hr 14min | | OpenAI | 12 | 2 | 4hr 38min | | Google AI | 8 | 3 | 5hr 12min | | Groq | 4 | 4 | 7hr 21min |
Anthropic has never had an incident longer than 2.5 hours in my monitoring period. OpenAI's longest was 4.5 hours. Groq had a 7+ hour incident in March.
Short, frequent incidents (Anthropic, OpenAI) suggest strong monitoring and fast response times. Fewer but longer incidents (Groq) suggest harder-to-diagnose issues.
What this means for production
If your SLA requires 99.9% uptime from your AI provider, only Anthropic and AWS Bedrock met that in H1 2025. Everyone else fell short.
| Your SLA requirement | Providers that meet it | |---------------------|----------------------| | 99.9% (8.7 hrs/year) | None (extrapolating H1 data) | | 99.5% (43.8 hrs/year) | Anthropic, AWS Bedrock, Fireworks, Azure | | 99.0% (87.6 hrs/year) | Most providers |
For mission-critical applications, multi-provider failover is still a necessity. No single AI API provider is reliable enough to be your only dependency.
My monitoring script will keep running. I'll publish the H2 report in January. The trends are improving (Q2 was better than Q1 for most providers), but we're still not at "boring infrastructure" reliability levels.
If you found this interesting, you might also like:
- 5 charts that explain why GPU prices went insane in 2021
- The training cost curve is doing something weird
- AI research papers published in 2021: a mid-year count
- My 2021 AI data roundup: the 10 numbers that mattered most
- Every model released in 2022 so far, in one table
-- dataku