All articles
Data-driven analysis of the AI world. Every number has a source.
202618 articles
Three Companies Now Control 90% of Frontier Inference
I counted every frontier model API available today. OpenAI, Anthropic, and Google serve roughly 90% of all production frontier inference. The concentration numbers are wild.
I counted every AI model released in Q1 2026
40+ models in 90 days. The pace is absurd. Here's the full count, month by month.
The inference cost collapse, in one chart
AI inference costs dropped 100x in 3 years. I put it all in one table and the trend line is almost vertical.
What I've learned tracking AI data for 5 years
Five years of counting models, tracking prices, and benchmarking everything I can get my hands on. The three things I got right, the five things I got wrong, and the one trend I still can't explain. This is the most personal article I've written.
The AI API price tracker: 5 years of data in one interactive chart
I've been tracking AI API prices since 2021. Today I'm publishing the full dataset: 89 price points across 12 providers over 5 years. The average cost per million tokens fell from $60 to $0.15. A 400x reduction. The chart tells the whole story.
Claude Opus 4.6 review: the 1M context model
Anthropic shipped a 1 million token context window on their flagship model. I tested retrieval at 100K, 250K, 500K, and 1M tokens. Accuracy stays above 90% up to 500K. At 1M it drops to 78%, but that's still usable. The long-context game has a new leader.
My monthly benchmark dashboard: March 2026 update
Monthly tracker updated. Claude Opus 4.5 still leads coding. Gemini 2.5 Ultra leads multimodal. o3 leads hard math. DeepSeek R2 leads cost-efficiency. New benchmark added: GPQA Diamond (graduate-level science questions). Full table inside.
AI startup funding in Q1 2026: where the money is going
$18.2 billion in Q1 2026. I broke it down: 41% went to infrastructure (chips, cloud), 28% to application-layer companies, 19% to model providers, 12% to tooling. The big shift: application funding overtook model funding for the first time.
o4-mini vs Claude 4 Sonnet vs Gemini 2.5 Flash: the speed tier showdown
The "fast and cheap" tier is where the real competition is. I compared the three on 200 tasks optimizing for speed and cost, not peak quality. Gemini Flash wins on price. o4-mini wins on coding. Claude Sonnet wins on general quality.
The MCP server catalog: 4,000 tools and counting
I scraped every MCP server registry I could find. 4,127 servers, 28,000+ tools. The most popular category is "file system" tools. The fastest growing is "database" tools. I charted the growth curve since Anthropic launched the protocol.
Gemini 2.5 Ultra: Google's best model vs the field
Google finally released Ultra-tier Gemini 2.5. I compared it against Claude Opus 4.5, GPT-4o, and DeepSeek R2 across 300 prompts. Gemini Ultra wins on multimodal tasks and long context. Claude wins on coding. The frontier is genuinely multi-polar now.
AI coding tools: 2026 market share data
Updated my developer survey with 400 respondents. Claude Code jumped to 22% usage share. Cursor held at 31%. Copilot dropped to 24%. The fastest growing? Windsurf at 8%, up from 2% six months ago.
The AI inference market: 25 providers ranked by price, speed, and reliability
My most thorough inference provider comparison yet. 25 providers, 60 days of monitoring, 3 metrics. Cerebras leads on speed. Together AI leads on open source model selection. Anthropic leads on reliability. Full rankings and methodology inside.
Claude Opus 4.5: Anthropic's latest flagship, benchmarked
Anthropic's newest model. I ran 300 prompts across coding, reasoning, writing, and analysis. Coding scores are the highest I've measured from any model. Reasoning matches o3 with thinking enabled. The gap between Sonnet and Opus has widened again.
Every AI pricing change in Q4 2025, tracked
14 price changes from 9 providers in the last quarter. The big story: Google dropped Gemini 2.5 Flash to $0.05/M input tokens. That's essentially free. Updated master comparison table inside.
DeepSeek R2: the open source reasoning model that costs pennies
DeepSeek R2 matches o3 on math benchmarks at 1/20th the inference cost. I ran my standard 200-problem reasoning evaluation. R2 scores 91.2% on MATH vs o3's 93.7%. At $0.14 vs $2.80 per hard problem, the economics aren't even close.
The state of AI benchmarks in early 2026: what still works?
MMLU is saturated. HumanEval is gamed. SWE-bench has contamination issues. I reviewed 20 active benchmarks and rated each on reliability, relevance, and resistance to gaming. Only 4 scored above 7/10. Chatbot Arena is still the gold standard.
My 2025 prediction scorecard
I predicted open source would match GPT-4 by mid-2025. It happened by Q1. I predicted API prices would fall 50%. They fell 90%. My biggest miss: I underestimated how fast reasoning models would improve. Full scorecard inside.
202548 articles
2025 in AI data: the year quality beat scale
Model sizes stopped growing. Training costs dropped 80%. Open source reached parity. Reasoning models showed that how you think matters more than how much you know. I compiled 30 charts telling the story of 2025.
AI hardware beyond NVIDIA: AMD, Intel, and custom silicon in 2025
AMD MI325X, Intel Gaudi 3, Google TPU v6, Amazon Trainium 2, and 5 startup chips. I compiled benchmark data where available. NVIDIA still leads, but the gap is 30%, not 300%. The moat is eroding.
The price of intelligence: tracking AI API costs since 2020
I built a complete timeline of AI API pricing from GPT-3 beta in 2020 to today. 47 price points across 5 years. The cost curve looks like a waterfall. Quality went up 10x while prices fell 100x. I've never seen anything like it in any industry.
Claude Opus 4 vs GPT-4o vs Gemini 2.5 Pro: the definitive Q4 comparison
My most thorough three-way comparison yet. 500 prompts, 8 categories, 3 human raters. Claude wins coding and analysis. GPT-4o wins speed and multimodal. Gemini wins on long-context and cost. There's no single best model anymore.
Small language models in production: who's deploying what
I surveyed 50 companies deploying LLMs in production. 62% use models under 13B parameters. The most popular: Llama 3.2 3B (18%), Phi-4 (14%), and Mistral 7B (12%). Small models aren't just for research anymore.
The LLM leaderboard is dead, long live the leaderboard
Hugging Face deprecated the Open LLM Leaderboard v1 and launched v2 with new benchmarks. I compared scores on both versions for 20 models. Some models dropped 15 points. The re-ranking is dramatic and some "top models" were just benchmark-optimized.
AI energy consumption data: the numbers are bigger than you think
I compiled power consumption data for AI training and inference from every source I could find. A single GPT-4 query uses about 10x the energy of a Google search. At current growth rates, AI could consume 3% of US electricity by 2028.
NVIDIA B200 benchmarks are out. The inference economics just changed again.
The B200 delivers 2.5x the inference throughput of the H100 at roughly the same power consumption. I compared the per-token cost on B200 vs H100 vs H200. If you're running inference at scale, the upgrade pays for itself in 4 months.
My monthly benchmark dashboard: September 2025 update
Monthly update to my running comparison of 15 models across 8 benchmarks. Big movers: Gemini 2.5 Pro gained 8 points on MMLU-Pro. Claude Opus 4 still leads on HumanEval. New entrant: Mistral Large 3.
o3 and the reasoning model cost problem
OpenAI's o3 uses up to 10x the tokens of a standard model to "think." On hard math problems, a single o3 query can cost $2. I measured the token consumption across 100 problems and the variance is massive: 500 tokens to 50,000.
The true cost of building an AI product in 2025: data from 30 startups
I surveyed 30 AI startups about their monthly costs. Median API spend: $8,400. Median total infra: $23,000. But the distribution is bimodal. Some spend $500/month with open source. Some spend $200,000 on API calls alone.
Llama 4 405B vs Llama 3.1 405B: same size, very different model
Meta kept the size but changed the architecture. Llama 4 405B uses MoE, so only ~100B parameters are active. I benchmarked both on 10 tasks. Llama 4 is faster and scores 8-12% higher on coding. Training quality over brute force.
The context window race is slowing down. Here's why that's fine.
In 2024, context windows doubled every 3 months. In 2025, they've barely changed. 1M tokens from Google. 200K from Anthropic. The reason? Most real-world tasks don't need more than 50K tokens. I have the usage data.
AI inference costs by country: why geography matters for API pricing
Some providers route inference through different regions. I measured latency and calculated effective costs from 5 countries. Running Claude from Japan costs the same as the US. Running a self-hosted model in India costs 30% less. The global pricing map is uneven.
Vision model benchmarks: who can actually read a chart?
I fed 50 real-world charts, tables, and diagrams to 8 multimodal models. Claude Opus 4 reads charts the most accurately at 89%. GPT-4o is at 82%. Gemini 2.5 Pro is at 85%. Most models struggle with handwritten text in images.
The cost per correct answer: a new way to compare models
Raw benchmark scores ignore cost. I calculated "cost per correct answer" across 500 questions for 10 models. The cheapest correct answer comes from Gemini 2.5 Flash at $0.0003. The most expensive is GPT-4.5 at $0.14. A 467x difference.
Claude Code vs Cursor vs Copilot Workspace: the AI coding war in data
I used all three on the same 20 real coding tasks. Claude Code completed 17. Cursor completed 15. Copilot Workspace completed 11. But completion rate isn't the whole story. I also tracked "time to working code" and "bugs introduced."
AI model release frequency by quarter: a 4-year chart
I've been counting notable model releases since Q1 2021. The quarterly total went from 8 to 67 to... 54 in Q2 2025. The first decline. I think we've hit peak model release rate. The era of consolidation begins.
The H100 resale market is crashing. Pricing data from 6 months.
H100 GPU resale prices dropped 40% from their January peak. I tracked listings on 4 broker sites. The DeepSeek efficiency shock plus H200/B200 availability is creating a glut. Good news for startups.
The frontier model gap just closed. Five models within 20 Elo points.
For the first time, the top 5 models on Chatbot Arena are within 20 Elo points of each other. Claude Opus 4, GPT-4o, Gemini 2.5 Pro, Grok 3, and DeepSeek V3. I analyzed what "virtually tied" means for model selection.
AI API uptime in H1 2025: the reliability report
Six months of continuous monitoring across 15 API providers. Anthropic: 99.7% uptime. OpenAI: 99.3%. Google: 99.1%. The outage patterns are interesting. Mondays and Thursdays are the worst days. I have theories about why.
I tested 10 local LLM runtimes. Ollama vs LM Studio vs llama.cpp vs...
Local inference has gotten shockingly good. I tested 10 runtimes on the same hardware (M3 Max, 64GB). Ollama wins on ease of use. llama.cpp wins on raw speed. The performance gap between local and cloud is narrowing.
The open weight model scene, mid-2025: who's winning?
Meta, Alibaba, Mistral, DeepSeek, and 12 others are all releasing open weight models. I ranked them by Chatbot Arena Elo, Hugging Face downloads, and community adoption. Llama still leads downloads, but Qwen is closing fast.
How much does it cost to run a chatbot with 1M daily users? I did the math.
1 million daily users, 5 messages each, average 300 tokens per response. At Claude 4 Sonnet pricing, that's $4,500/day. At GPT-4o mini, it's $300/day. I modeled the economics for 6 different model tiers.
The SWE-bench Verified leaderboard: who's actually solving real bugs?
SWE-bench Verified filters out the easy problems. I compared scores on full SWE-bench vs Verified for 12 models. Some models drop 20+ points. The gap reveals who's gaming the benchmark vs who's actually good at coding.
AI model sizes are SHRINKING. Here's the data.
The biggest model released in 2025 so far has fewer parameters than GPT-4. Efficiency gains from MoE, distillation, and better training data mean the era of "bigger is better" is fading. I charted the trend.
Claude 4 Sonnet vs GPT-4o vs Gemini 2.5 Flash: the mid-tier model war
The mid-tier is where most developers actually work. I compared the three most popular "not-the-flagship" models on real-world tasks: summarization, extraction, classification, and code generation. Claude 4 Sonnet wins 3 of 4.
The inference provider market: latency, cost, and uptime for 20 providers
I expanded my monthly monitoring to 20 providers. The new additions: Cerebras, Fireworks, Baseten, Modal, and Replicate. Cerebras leads on latency. Fireworks leads on cost efficiency. Updated rankings inside.
The benchmark contamination problem is getting worse. New evidence.
I tested 15 models for memorization of MMLU questions. 4 of them could complete benchmark questions from the first few words alone. Contamination isn't just theoretical anymore. I can measure it.
AI agent frameworks: LangChain vs CrewAI vs Autogen. A data comparison.
I built the same 5 agent tasks on each framework and measured completion rates, token usage, and time to complete. LangChain is the most flexible. CrewAI finishes fastest. Autogen uses the fewest tokens. No clear winner.
Qwen3 and the Chinese model wave: benchmarking 5 models from China
Qwen3, DeepSeek V3, Yi-Lightning, Baichuan 4, and MiniMax-01. I benchmarked all five against Claude 3.7 Sonnet and GPT-4o. Chinese models now occupy 3 of the top 10 spots on Chatbot Arena. The geographic distribution of AI talent is shifting.
Claude Opus 4 is here. My first benchmark impressions.
Anthropic's new flagship model. Extended thinking, tool use, and code generation all feel meaningfully better. I ran my standard 300-prompt evaluation. Early data: it's the best model I've tested on coding tasks. Full analysis next week.
The cost of AI dropped 97% in two years. One chart.
In March 2023, GPT-4 cost $60 per million output tokens. Today, GPT-4o mini costs $0.60. Same-class quality, 100x cheaper. I made one chart. That's the whole article. Sometimes the data speaks for itself.
Gemini 2.5 Pro just took #1 on Chatbot Arena. The data behind the shift.
For the first time, a Google model sits at the top of the LMSYS leaderboard. I analyzed the vote patterns. Gemini 2.5 Pro dominates in coding and math. Claude still leads in creative tasks. The throne is now contestable.
Llama 4 Scout and Maverick: Meta's MoE play, in data
Meta went mixture-of-experts with Llama 4. Scout is 17B active parameters from 109B total. Maverick is 17B from 400B. I benchmarked both against Llama 3.1 70B. The efficiency gains are exactly what the MoE math predicts.
I benchmarked 8 reasoning models on the same 100 math problems
o1, o3-mini, DeepSeek R1, Claude 3.7 Sonnet (thinking), Gemini 2.5 Pro, Grok 3, QwQ-32B, and Phi-4. Same 100 MATH problems. Same evaluation criteria. The spread is tighter than you'd expect from the marketing.
The AI coding tool market is fragmenting. Here are the usage numbers.
I surveyed 200 developers about their AI coding tools. Cursor has 34% usage share. Copilot dropped to 28%. Claude Code is at 12% and rising fast. The "winner take all" era is over.
The MCP protocol: how many tools does an AI agent actually need?
Anthropic's Model Context Protocol is becoming the standard for AI tool use. I surveyed 30 MCP server implementations and counted the tools each provides. The median is 7 tools. The maximum is 94. More isn't always better.
Claude 3.7 Sonnet: Anthropic's hybrid thinking model, benchmarked
Claude 3.7 Sonnet can toggle extended thinking on and off. I tested it in both modes across 200 prompts. With thinking on, it matches o1 on MATH. With thinking off, it's still the best general-purpose model on Chatbot Arena.
GPT-4.5 is the most expensive model ever released. Is it worth it?
$75 per million input tokens. That's 500x more than GPT-4o mini. I ran GPT-4.5 through my evaluation suite. It's good. Really good. But at this price, it only makes economic sense for a very narrow set of tasks.
The open source model release velocity is unsustainable. Here's why.
I counted 142 models released on Hugging Face in February 2025 alone. That's 5 per day. Downloads are up but download-per-model is falling. The attention pie is finite. I think a shakeout is coming.
Gemini 2.5 Pro and "thinking" models: Google's answer to o1
Google added extended thinking to Gemini. I tested it against o1-preview and DeepSeek R1 on math and coding problems. Gemini 2.5 Pro wins on 4 of 6 benchmarks. Google is back in the reasoning race.
Grok 3 and the xAI compute cluster: throwing brute force at AI
xAI built a 100K GPU cluster in Memphis. Grok 3 is the first model trained on it. The benchmarks are competitive with Claude 3.5 Sonnet and GPT-4o. I ran my standard evaluation. It's good, but the interesting story is the infrastructure bet.
The real cost of AI agents: I tracked token usage for 50 agentic tasks
AI agents sound cheap per token. But they loop. A lot. I measured the total token consumption for 50 real agent tasks across Claude, GPT-4o, and Gemini. The average task used 47K tokens. Some hit 200K+.
Claude 3.5 Sonnet is still #1 on Chatbot Arena. For how long?
Six months at the top of the LMSYS leaderboard. I pulled the vote data and looked at the categories where Claude 3.5 Sonnet wins most decisively: coding (Elo 1290), creative writing (1285), and instruction following (1280).
Every AI pricing change in January 2025, tracked
Seven providers changed prices in January alone. Anthropic dropped Claude 3.5 Haiku's price. Google cut Gemini Flash. I updated the master table. The cheapest frontier-class model is now $0.10 per million input tokens.
The DeepSeek effect: AI stock prices dropped $1 trillion in a day. The data.
When DeepSeek showed you could train a frontier model for $5.6M, NVIDIA lost $589 billion in market cap in a single day. I charted the stock movements of every major AI company. The repricing of "compute moats" was instant.
DeepSeek R1 just broke every reasoning benchmark. And it's open source.
DeepSeek R1 matches o1 on math and coding benchmarks at a fraction of the cost. And they released the weights. I compared R1 against o1-preview on 200 reasoning problems. The scores are within 2 points on MATH and GPQA.
202430 articles
DeepSeek V3: a Chinese model that costs almost nothing to train
DeepSeek V3 reportedly cost $5.6M to train. GPT-4 allegedly cost $100M+. I dug into the technical report and the training efficiency numbers. If these costs are real, the frontier just got a lot more accessible.
My 2024 prediction scorecard: reasoning models were my biggest miss
I didn't predict reasoning models at all. I thought scale would keep winning. Instead, o1 showed that inference-time compute is a whole new axis. My biggest hit? Predicting open source would reach GPT-4 level by year end.
2024 AI data roundup: the year of commoditization
API prices fell 90%. Open source matched GPT-4. Reasoning models appeared. AI coding assistants went mainstream. I compiled 25 charts that tell the story of 2024's wild ride.
Google Gemini 2.0 Flash: the speed-to-quality ratio is unprecedented
Gemini 2.0 Flash matches GPT-4o on most of my tests while being 3x faster and significantly cheaper. I ran my standard evaluation across 300 prompts. Google finally has a model that's both fast and good.
The Q4 2024 model release tracker: 67 models in 90 days
I tracked every notable model release in Q4 2024. Sixty-seven models from 23 organizations. That's nearly one model per day. The pace is unsustainable and I suspect a consolidation is coming.
The state of AI APIs: speed, cost, and reliability across 15 providers
I monitored 15 AI API providers for 30 days straight, logging latency, error rates, and uptime. The results are a mess. Anthropic has the best uptime. Groq has the best speed. Nobody has both.
Claude 3.5 Sonnet (new) and computer use: my first benchmark data
Anthropic updated Claude 3.5 Sonnet and added computer use. I tested both the model improvements and the computer use capability. Model quality jumped noticeably. Computer use works about 60% of the time in my tests.
The inference cost of reasoning models: o1 vs Claude 3.5 Sonnet per correct answer
Reasoning models use more tokens to think. But if they get the answer right more often, the cost per CORRECT answer might actually be lower. I ran the math on 500 coding problems. The results surprised me.
Qwen 2.5 is the best open source model nobody is talking about
Alibaba's Qwen 2.5 72B beats Llama 3.1 70B on my tests. It's also the best model for CJK languages by a wide margin. I benchmarked it in English, Chinese, and Japanese. The English results alone deserve attention.
o1 and 'reasoning' models: the benchmark scores look different this time
OpenAI's o1 trades speed for accuracy by 'thinking' before answering. The math and coding benchmarks are way up, but the costs are 6x higher per task. I broke down the cost-per-correct-answer metric and it's actually competitive.
The SWE-bench problem: are coding benchmarks measuring the right thing?
Every new model touts its SWE-bench score. I analyzed the test cases and found 23% of them can be 'solved' by a simple regex patch. The benchmark isn't wrong exactly, but it's not measuring what you think.
OpenAI just launched their cheapest model. Here's every price tier compared.
Updated master pricing table with 34 models from 9 providers. The cheapest useful model is now Gemini 1.5 Flash at $0.075/M input tokens. Three years ago that would've cost $60. I charted the deflation.
The cost of running Llama 3.1 405B: cloud vs self-hosted, the full math
Running 405B parameters needs serious hardware. I priced out 4 configurations: AWS, Lambda Labs, self-hosted with 8xA100s, and 8xH100s. The monthly costs range from $4,200 to $31,000 depending on utilization.
Llama 3.1 405B: the first truly GPT-4 class open model. My benchmark data.
Meta released a 405 billion parameter model under an open license. I ran it on 10 standard benchmarks and 5 of my own. It matches GPT-4 within margin of error on 7 of 15. This is a milestone.
GPT-4o mini is $0.15 per million tokens. The race to the bottom is real.
GPT-4o mini costs 100x less than GPT-4 did at launch. I plotted the price per million tokens for OpenAI's best available model at each point in time. The curve is a cliff.
Claude 3.5 Sonnet is better than Claude 3 Opus. And it's 5x cheaper.
The mid-tier model just beat the flagship. I ran Claude 3.5 Sonnet through every test I used for Opus, and it wins on 71% of them. At $3/M tokens vs $15, the value math is absurd.
I benchmarked 12 coding assistants. Cursor is not what I expected.
GitHub Copilot, Cursor, Cody, Continue, Tabnine, and 7 others. I used each one for a full week and tracked acceptance rates, bug rates, and time saved. Cursor surprised me. Copilot disappointed me.
Gemini 1.5 Pro has a 1 million token context window. I tested it with real documents.
Google says 1 million tokens. That's approximately 1,500 pages. I fed it actual long documents and tested retrieval at various depths. Performance degrades gracefully until about 800K, then falls off a cliff.
The LMSYS Elo gap between open and closed source models just shrank to 50 points
In January 2023, the Elo gap between the best open source model and GPT-4 was 200+ points. It's now about 50. I charted the convergence curve. At this rate, parity arrives in Q3 2024.
The 'vibe check' era: why benchmarks are losing to vibes
I asked 50 AI developers how they evaluate models. 73% said 'I just try it and see how it feels.' Only 12% run formal benchmarks. The industry is moving from data-driven evaluation to... vibes. I have mixed feelings about this.
GPT-4o is multimodal AND cheaper. I have questions about the pricing.
OpenAI released GPT-4o at half the price of GPT-4 Turbo, with vision and audio included. I calculated the per-task costs across text, image, and audio. The audio pricing is suspiciously cheap.
Phi-3 Mini is a 3.8B model that's shockingly good. Small model benchmarks.
Microsoft's Phi-3 Mini has 3.8 billion parameters and beats Llama 3 8B on several benchmarks. I ran it locally on a MacBook M2. The small model revolution is accelerating faster than the big model one.
Llama 3 8B beats Llama 2 70B. Let that sink in.
A model 9x smaller is now better. I benchmarked Llama 3 8B against Llama 2 70B on 6 tasks. The small model wins on 4 of them. Training data quality is eating model size for breakfast.
The AI chip market in 2024: not just NVIDIA anymore
I compiled specs and benchmarks for every AI accelerator announced in the last 12 months. NVIDIA H100, AMD MI300X, Google TPU v5e, Groq LPU, Intel Gaudi 3, and 8 others. The competition is finally real.
The Claude 3 model family pricing is actually brilliant. Here's why.
Haiku at $0.25/M tokens, Sonnet at $3, Opus at $15. Anthropic isn't just pricing models, they're pricing use cases. I compared the price-to-quality ratio across all three and the tiering makes perfect economic sense.
Claude 3 Opus is the first model to genuinely worry me about benchmarks
Claude 3 Opus matched or beat GPT-4 on most benchmarks, but the 'needle in a haystack' test is what got me. It detected that it was being tested. I ran my own version and the results are strange.
Mistral Large vs GPT-4 vs Claude 3 Opus: the three-way benchmark
Mistral finally has a frontier model. I ran all three through my standard 300-prompt evaluation. Mistral Large is competitive but not quite there. The interesting part is where it wins: European languages.
Groq's LPU just served me 800 tokens per second. The inference speed data.
Groq's custom chip hit 800 tokens/second on Mixtral 8x7B. I measured latency across 100 requests and compared to 5 other inference providers. Groq is 18x faster than the median. Speed changes what's possible.
Every LLM API price drop in the last 12 months, in one chart
I logged every API price change since January 2023. There have been 23 price drops across 8 providers. The average price of a million output tokens fell 78%. I've never seen deflation this fast in tech.
Mixtral 8x7B is free to run and matches GPT-3.5. The inference economics are changing.
I set up Mixtral on a single A100 and benchmarked throughput. At 95 tokens/second, the cost per million tokens is $0.18. The OpenAI API charges $0.50. Open source inference is now genuinely cheaper.
202324 articles
My 2023 prediction scorecard
I predicted open source would stay 2 years behind closed source. I was wrong by a lot. Llama 2 closed the gap in months. Here's my full scorecard for 2023.
2023 AI data roundup: the year the dam broke
GPT-4, Llama 2, Mistral, Claude 2, SDXL, and ChatGPT hitting 100M users. I compiled 20 charts that tell the story of 2023. This was the year AI stopped being a niche interest.
Mixtral 8x7B: the MoE model that changes the economics of inference
Mistral dropped Mixtral via a magnet link (no paper, no blog post, just a torrent). The benchmarks leaked within hours. A mixture-of-experts model at GPT-3.5 quality with 12B active parameters? The inference cost math is wild.
Google Gemini benchmarks vs GPT-4: reading the fine print
Google claims Gemini Ultra beats GPT-4 on 30 of 32 benchmarks. I dug into the methodology. They're comparing against launch-day GPT-4, not the current version. And some of the benchmark configurations are... creative.
The 'contamination' problem: when benchmarks stop meaning anything
I found evidence that at least 6 models on the Hugging Face leaderboard were trained on benchmark test data. When your test set is in the training data, your scores are meaningless. I built a simple check for this.
GPT-4 Turbo is 3x cheaper. Here's what that means for the API pricing war.
OpenAI just slashed GPT-4 prices by 3x with GPT-4 Turbo. I updated my master pricing comparison table. The gap between open source and closed source API costs is narrowing fast.
The GPU shortage data: who has capacity and who's lying about it
I surveyed 40 AI companies about GPU access. 78% reported 'severe constraints.' But cloud provider utilization data tells a slightly different story. Some companies have more H100s than they're admitting.
How I track AI model releases: my personal data system
People keep asking how I stay on top of all these model releases. Here's my actual system: RSS feeds, arXiv alerts, a spreadsheet with 312 rows, and a Python script that checks Hugging Face daily.
Every major LLM's context window, charted over time
In January 2023, 4K tokens was standard. By October, we've got 100K (Claude), 32K (GPT-4), and 128K (Anthropic internal). I charted the context window growth curve. It's exponential.
Claude 2 is surprisingly good at long documents. Here's my data.
Claude 2's 100K context window is its killer feature. I tested it with documents of 10K, 25K, 50K, and 100K tokens. Retrieval accuracy drops from 97% to 71% as length increases, but that's still way better than chunking strategies.
Mistral 7B just beat Llama 2 13B. Small models are getting weird.
A 7B parameter model outperforming a 13B model shouldn't be possible under simple scaling laws. But Mistral did it. I compared the benchmarks and the architecture differences that explain how.
LMSYS Chatbot Arena has 200K votes. It might be the best benchmark we have.
LMSYS's crowdsourced Elo ratings are based on 200K+ human votes of blind model comparisons. I analyzed the vote distributions and demographic patterns. It's noisy, but it's the closest thing to 'what real users think.'
The real cost of training Llama 2: Meta's numbers vs my estimates
Meta says Llama 2 70B used 1.7M GPU hours of A100 time. At current cloud prices, that's roughly $5.4M. But Meta used their own hardware. I estimated the real cost and it's probably 60% less.
11 charts that explain the open source AI wave
Open source models went from curiosity to contender in 18 months. I made 11 charts tracking downloads, benchmark scores, funding, and community growth. The trend line is unmistakable.
The cost of self-hosting vs API: a real comparison for Llama 2
Can you actually save money running Llama 2 yourself instead of using the OpenAI API? I calculated it. The answer depends on your volume, but the break-even point is lower than I expected.
Llama 2 is here and it's actually good. My benchmark data.
Meta released Llama 2 with a commercial license. I benchmarked the 70B model against GPT-3.5-turbo on 8 tasks. Llama 2 70B matches or beats GPT-3.5 on 5 of them. Open source just got real.
AI funding in Q1 2023 is absolutely bonkers. Let me show you the numbers.
$12.4 billion in AI startup funding in Q1 2023 alone. That's more than all of 2020. I broke it down by category, stage, and geography. Generative AI is 73% of the total.
The Hugging Face Open LLM Leaderboard is becoming the de facto benchmark. That's a problem.
Every open source model now optimizes for the Hugging Face leaderboard. I checked: 12 of the top 20 models were specifically fine-tuned on leaderboard benchmark data. Goodhart's Law is hitting AI benchmarks hard.
Claude vs GPT-4: my first head-to-head data comparison
Anthropic's Claude is in beta and I got access. I ran both models through 300 prompts across coding, writing, and reasoning. Claude wins on length and nuance. GPT-4 wins on accuracy. The data is tight.
GPT-4 is 10x more expensive than GPT-3.5. Is it 10x better?
GPT-4 costs $0.03/1K input tokens vs $0.002 for GPT-3.5-turbo. That's a 15x price jump. I ran 500 real-world tasks on both and measured quality. The value proposition is... complicated.
LLaMA leaked. Here's what Meta's model weights actually look like.
Meta's LLaMA was supposed to be research-only. It leaked within a week. Now everyone can benchmark it. I ran LLaMA-13B against GPT-3.5 on 5 tasks. The results are closer than Meta probably wanted.
I counted every AI model released this quarter. Here's what I found.
Q4 2022 had 31 notable model releases. Q1 2023 is on pace for 58. The acceleration is real, and it's not just one company driving it. I categorized every single one.
The LLM pricing war just started. Here's every provider's cost per token.
OpenAI, Anthropic, Cohere, AI21 Labs, and Google all have LLM APIs now. I made a comparison table of every pricing tier. The spread is 47x between the cheapest and most expensive option.
GPT-4 benchmark scores are insane. But let me show you the fine print.
Everyone is sharing GPT-4's bar exam score. Almost nobody is talking about the benchmarks where it barely beats GPT-3.5. I broke down all 23 benchmarks in the technical report. The picture is more mixed than the headlines suggest.
202218 articles
My 2022 prediction scorecard: how wrong was I?
In January I made 10 predictions about AI in 2022. I got 4 right, 3 half-right, and 3 completely wrong. The biggest miss? I didn't predict ChatGPT would exist.
2022 in AI data: the year everything accelerated
From DALL-E 2 to ChatGPT, 2022 was the year AI left the research lab. I compiled 15 charts that tell the story. The most important number? ChatGPT's 1M users in 5 days vs GPT-3's 300K waitlist after 6 months.
ChatGPT vs GPT-3: same model family, wildly different results. The data.
ChatGPT is based on GPT-3.5, but it behaves nothing like the raw API. I ran 200 identical prompts on both. ChatGPT refuses 23% of prompts that GPT-3 answers happily. RLHF changed more than people think.
ChatGPT hit 1 million users in 5 days. Here's the growth data in context.
I compared ChatGPT's user growth curve to Instagram, TikTok, Spotify, and Netflix. Nothing comes close. ChatGPT's first week makes every other consumer tech launch look slow.
Wait, Stable Diffusion has HOW many forks? The open source explosion in numbers.
Three months after Stable Diffusion's release, I counted 847 forks and derivative projects on GitHub. The rate of open source AI proliferation is unlike anything I've seen in tech.
I ran GPT-3 on the same 50 questions every month for a year. Here's the drift.
Model outputs aren't static. I asked GPT-3 the same 50 factual questions monthly for 12 months. 17 answers changed. Some got better. Some got worse. 'Model drift' is real and measurable.
Anthropic just raised $580M. Let's talk about the AI safety funding numbers.
I compiled every dollar raised by AI safety organizations in 2022. The total is $1.9 billion. But 87% went to just two companies. The distribution is incredibly top-heavy.
GitHub Copilot: 6 months of usage data from my own coding
I logged every Copilot suggestion for 6 months. Accepted 34.2% of them. The acceptance rate varies wildly by language: 52% for Python, 18% for Rust.
The Chinchilla scaling laws changed everything. Let me show you why.
DeepMind's Chinchilla paper says most large models are undertrained. I ran the numbers: if Chinchilla's scaling laws are right, GPT-3 should have used 4.6x more training data. The implications are huge.
Every model released in 2022 so far, in one table
47 notable models in 9 months. I put them all in a table with release date, parameters, training data size, and whether they're open or closed. The pattern is hard to miss.
Stable Diffusion is free. The pricing math of open source image generation.
Stability AI released Stable Diffusion and suddenly image generation costs dropped from ~$0.02/image (DALL-E 2) to essentially free if you have a GPU. I calculated the break-even point.
Midjourney v3 vs DALL-E 2: 100 prompts, head to head
Same 100 prompts, two models, blind rating by 5 people. Midjourney wins on 'aesthetic feel' 64% of the time. DALL-E 2 wins on 'prompt accuracy' 71% of the time. The data is fascinating.
Open source AI is having a moment. Here are the download numbers.
BLOOM just launched. GPT-NeoX is out. I pulled download stats from Hugging Face for every open source LLM. The adoption curves are starting to look serious.
InstructGPT and RLHF: what the training data tells us
OpenAI's InstructGPT paper has fascinating details about the human labeler workforce. 40 contractors, 5 steps, and the data quality metrics that made RLHF work.
I tracked AI image generation quality over 6 months. The improvement rate is scary.
I've been generating the same 50 prompts on each new model as it releases. The quality jump from January to April 2022 is the steepest improvement curve I've ever plotted.
Google's PaLM has 540 billion parameters. Let me put that number in context.
Every time a new model drops, the parameter count gets bigger and the context gets lost. I made a chart showing every major model's parameter count since 2018. PaLM is... a lot.
The cost of running an AI startup in 2022: a data breakdown
I surveyed 23 AI startup founders about their cloud compute bills. The median monthly GPU spend is $14,000. One is paying $200,000/month. The variance is absurd.
DALL-E 2 is out. I ran 200 prompts and measured the results.
I generated 200 images across 10 categories and rated coherence, prompt adherence, and artifact frequency. DALL-E 2 is good, but 'good' means different things for different prompt types.
202112 articles
My 2021 AI data roundup: the 10 numbers that mattered most
From GPT-3's pricing to GPU shortages to the rise of the Hugging Face model zoo. These are the 10 data points from 2021 that I think will matter most looking back.
AI research papers published in 2021: a mid-year count
I counted arXiv submissions with "artificial intelligence", "machine learning", and "deep learning" in the title. 2021 is on pace to smash 2020's record by 34%.
Hugging Face just hit 10,000 models. Here's what the model zoo looks like.
I scraped the Hugging Face model hub and categorized all 10,000+ models by type, language, and download count. Text generation is only 8% of the total. The real king is NER.
The training cost curve is doing something weird
I plotted the estimated training costs of every major model from 2018 to 2021. The curve isn't going up linearly. It's doing something much weirder, and the inflection point was GPT-3.
5 charts that explain why GPU prices went insane in 2021
I tracked GPU prices across eBay, Newegg, and Amazon for six months. The RTX 3090 hit 3x MSRP in February. Here's the full timeline with data.
GPT-3 vs GPT-J: the first real open source challenger, in data
EleutherAI released GPT-J-6B and I benchmarked it against GPT-3's comparable size. For a free model, the numbers are surprisingly close on some tasks.
Codex and the cost of code generation: my first pricing analysis
OpenAI's Codex is in private beta and I got access. I ran 500 code generation requests and tracked the token costs. Generating a Python function costs about $0.003 on average.
I counted every AI startup that raised money in Q1 2021. The numbers are strange.
127 AI startups raised funding in Q1 2021. I categorized all of them. The "generative AI" category barely exists yet. Most money is still going to enterprise ML tools.
DALL-E's first images vs what people expected: a data comparison
OpenAI's DALL-E paper dropped in January and I've been collecting reaction data. The gap between what researchers expected and what it actually produces is measurable.
The GPT-3 API waitlist is 6 months long. Here's what the early data looks like.
I've been tracking GPT-3 API access reports since launch. The waitlist data tells a story about who OpenAI is letting in first, and it's not random.
Every AI benchmark from 2020, ranked by how much they actually tell you
I went through 14 major benchmarks used in 2020 AI papers. Some are genuinely useful. Some are theater. Here's my ranking with the data to back it up.
Wait, GPT-3 costs HOW much per token?
I spent a weekend calculating the actual per-word cost of GPT-3's different engines. The price difference between Davinci and Ada is wild, and most people are using the wrong one.