Three Companies Now Control 90% of Frontier Inference
I counted every frontier model API available today. OpenAI, Anthropic, and Google serve roughly 90% of all production frontier inference. The concentration numbers are wild.
AI through the lens of data. Benchmarks, pricing trends, model comparisons. Let me show you something interesting in the numbers.
I counted every frontier model API available today. OpenAI, Anthropic, and Google serve roughly 90% of all production frontier inference. The concentration numbers are wild.
40+ models in 90 days. The pace is absurd. Here's the full count, month by month.
AI inference costs dropped 100x in 3 years. I put it all in one table and the trend line is almost vertical.
Five years of counting models, tracking prices, and benchmarking everything I can get my hands on. The three things I got right, the five things I got wrong, and the one trend I still can't explain. This is the most personal article I've written.
I've been tracking AI API prices since 2021. Today I'm publishing the full dataset: 89 price points across 12 providers over 5 years. The average cost per million tokens fell from $60 to $0.15. A 400x reduction. The chart tells the whole story.
Anthropic shipped a 1 million token context window on their flagship model. I tested retrieval at 100K, 250K, 500K, and 1M tokens. Accuracy stays above 90% up to 500K. At 1M it drops to 78%, but that's still usable. The long-context game has a new leader.
Monthly tracker updated. Claude Opus 4.5 still leads coding. Gemini 2.5 Ultra leads multimodal. o3 leads hard math. DeepSeek R2 leads cost-efficiency. New benchmark added: GPQA Diamond (graduate-level science questions). Full table inside.
$18.2 billion in Q1 2026. I broke it down: 41% went to infrastructure (chips, cloud), 28% to application-layer companies, 19% to model providers, 12% to tooling. The big shift: application funding overtook model funding for the first time.
The "fast and cheap" tier is where the real competition is. I compared the three on 200 tasks optimizing for speed and cost, not peak quality. Gemini Flash wins on price. o4-mini wins on coding. Claude Sonnet wins on general quality.
I scraped every MCP server registry I could find. 4,127 servers, 28,000+ tools. The most popular category is "file system" tools. The fastest growing is "database" tools. I charted the growth curve since Anthropic launched the protocol.
Google finally released Ultra-tier Gemini 2.5. I compared it against Claude Opus 4.5, GPT-4o, and DeepSeek R2 across 300 prompts. Gemini Ultra wins on multimodal tasks and long context. Claude wins on coding. The frontier is genuinely multi-polar now.
Updated my developer survey with 400 respondents. Claude Code jumped to 22% usage share. Cursor held at 31%. Copilot dropped to 24%. The fastest growing? Windsurf at 8%, up from 2% six months ago.
My most thorough inference provider comparison yet. 25 providers, 60 days of monitoring, 3 metrics. Cerebras leads on speed. Together AI leads on open source model selection. Anthropic leads on reliability. Full rankings and methodology inside.
Anthropic's newest model. I ran 300 prompts across coding, reasoning, writing, and analysis. Coding scores are the highest I've measured from any model. Reasoning matches o3 with thinking enabled. The gap between Sonnet and Opus has widened again.
14 price changes from 9 providers in the last quarter. The big story: Google dropped Gemini 2.5 Flash to $0.05/M input tokens. That's essentially free. Updated master comparison table inside.
DeepSeek R2 matches o3 on math benchmarks at 1/20th the inference cost. I ran my standard 200-problem reasoning evaluation. R2 scores 91.2% on MATH vs o3's 93.7%. At $0.14 vs $2.80 per hard problem, the economics aren't even close.
MMLU is saturated. HumanEval is gamed. SWE-bench has contamination issues. I reviewed 20 active benchmarks and rated each on reliability, relevance, and resistance to gaming. Only 4 scored above 7/10. Chatbot Arena is still the gold standard.
I predicted open source would match GPT-4 by mid-2025. It happened by Q1. I predicted API prices would fall 50%. They fell 90%. My biggest miss: I underestimated how fast reasoning models would improve. Full scorecard inside.
Model sizes stopped growing. Training costs dropped 80%. Open source reached parity. Reasoning models showed that how you think matters more than how much you know. I compiled 30 charts telling the story of 2025.
AMD MI325X, Intel Gaudi 3, Google TPU v6, Amazon Trainium 2, and 5 startup chips. I compiled benchmark data where available. NVIDIA still leads, but the gap is 30%, not 300%. The moat is eroding.
I built a complete timeline of AI API pricing from GPT-3 beta in 2020 to today. 47 price points across 5 years. The cost curve looks like a waterfall. Quality went up 10x while prices fell 100x. I've never seen anything like it in any industry.
My most thorough three-way comparison yet. 500 prompts, 8 categories, 3 human raters. Claude wins coding and analysis. GPT-4o wins speed and multimodal. Gemini wins on long-context and cost. There's no single best model anymore.
I surveyed 50 companies deploying LLMs in production. 62% use models under 13B parameters. The most popular: Llama 3.2 3B (18%), Phi-4 (14%), and Mistral 7B (12%). Small models aren't just for research anymore.
Hugging Face deprecated the Open LLM Leaderboard v1 and launched v2 with new benchmarks. I compared scores on both versions for 20 models. Some models dropped 15 points. The re-ranking is dramatic and some "top models" were just benchmark-optimized.
I compiled power consumption data for AI training and inference from every source I could find. A single GPT-4 query uses about 10x the energy of a Google search. At current growth rates, AI could consume 3% of US electricity by 2028.
The B200 delivers 2.5x the inference throughput of the H100 at roughly the same power consumption. I compared the per-token cost on B200 vs H100 vs H200. If you're running inference at scale, the upgrade pays for itself in 4 months.
Monthly update to my running comparison of 15 models across 8 benchmarks. Big movers: Gemini 2.5 Pro gained 8 points on MMLU-Pro. Claude Opus 4 still leads on HumanEval. New entrant: Mistral Large 3.
OpenAI's o3 uses up to 10x the tokens of a standard model to "think." On hard math problems, a single o3 query can cost $2. I measured the token consumption across 100 problems and the variance is massive: 500 tokens to 50,000.
I surveyed 30 AI startups about their monthly costs. Median API spend: $8,400. Median total infra: $23,000. But the distribution is bimodal. Some spend $500/month with open source. Some spend $200,000 on API calls alone.
Meta kept the size but changed the architecture. Llama 4 405B uses MoE, so only ~100B parameters are active. I benchmarked both on 10 tasks. Llama 4 is faster and scores 8-12% higher on coding. Training quality over brute force.
In 2024, context windows doubled every 3 months. In 2025, they've barely changed. 1M tokens from Google. 200K from Anthropic. The reason? Most real-world tasks don't need more than 50K tokens. I have the usage data.
Some providers route inference through different regions. I measured latency and calculated effective costs from 5 countries. Running Claude from Japan costs the same as the US. Running a self-hosted model in India costs 30% less. The global pricing map is uneven.
I fed 50 real-world charts, tables, and diagrams to 8 multimodal models. Claude Opus 4 reads charts the most accurately at 89%. GPT-4o is at 82%. Gemini 2.5 Pro is at 85%. Most models struggle with handwritten text in images.
Raw benchmark scores ignore cost. I calculated "cost per correct answer" across 500 questions for 10 models. The cheapest correct answer comes from Gemini 2.5 Flash at $0.0003. The most expensive is GPT-4.5 at $0.14. A 467x difference.
I used all three on the same 20 real coding tasks. Claude Code completed 17. Cursor completed 15. Copilot Workspace completed 11. But completion rate isn't the whole story. I also tracked "time to working code" and "bugs introduced."
I've been counting notable model releases since Q1 2021. The quarterly total went from 8 to 67 to... 54 in Q2 2025. The first decline. I think we've hit peak model release rate. The era of consolidation begins.
H100 GPU resale prices dropped 40% from their January peak. I tracked listings on 4 broker sites. The DeepSeek efficiency shock plus H200/B200 availability is creating a glut. Good news for startups.
For the first time, the top 5 models on Chatbot Arena are within 20 Elo points of each other. Claude Opus 4, GPT-4o, Gemini 2.5 Pro, Grok 3, and DeepSeek V3. I analyzed what "virtually tied" means for model selection.
Six months of continuous monitoring across 15 API providers. Anthropic: 99.7% uptime. OpenAI: 99.3%. Google: 99.1%. The outage patterns are interesting. Mondays and Thursdays are the worst days. I have theories about why.
Local inference has gotten shockingly good. I tested 10 runtimes on the same hardware (M3 Max, 64GB). Ollama wins on ease of use. llama.cpp wins on raw speed. The performance gap between local and cloud is narrowing.
Meta, Alibaba, Mistral, DeepSeek, and 12 others are all releasing open weight models. I ranked them by Chatbot Arena Elo, Hugging Face downloads, and community adoption. Llama still leads downloads, but Qwen is closing fast.
1 million daily users, 5 messages each, average 300 tokens per response. At Claude 4 Sonnet pricing, that's $4,500/day. At GPT-4o mini, it's $300/day. I modeled the economics for 6 different model tiers.
SWE-bench Verified filters out the easy problems. I compared scores on full SWE-bench vs Verified for 12 models. Some models drop 20+ points. The gap reveals who's gaming the benchmark vs who's actually good at coding.
The biggest model released in 2025 so far has fewer parameters than GPT-4. Efficiency gains from MoE, distillation, and better training data mean the era of "bigger is better" is fading. I charted the trend.
The mid-tier is where most developers actually work. I compared the three most popular "not-the-flagship" models on real-world tasks: summarization, extraction, classification, and code generation. Claude 4 Sonnet wins 3 of 4.
I expanded my monthly monitoring to 20 providers. The new additions: Cerebras, Fireworks, Baseten, Modal, and Replicate. Cerebras leads on latency. Fireworks leads on cost efficiency. Updated rankings inside.
I tested 15 models for memorization of MMLU questions. 4 of them could complete benchmark questions from the first few words alone. Contamination isn't just theoretical anymore. I can measure it.
I built the same 5 agent tasks on each framework and measured completion rates, token usage, and time to complete. LangChain is the most flexible. CrewAI finishes fastest. Autogen uses the fewest tokens. No clear winner.
Qwen3, DeepSeek V3, Yi-Lightning, Baichuan 4, and MiniMax-01. I benchmarked all five against Claude 3.7 Sonnet and GPT-4o. Chinese models now occupy 3 of the top 10 spots on Chatbot Arena. The geographic distribution of AI talent is shifting.
Anthropic's new flagship model. Extended thinking, tool use, and code generation all feel meaningfully better. I ran my standard 300-prompt evaluation. Early data: it's the best model I've tested on coding tasks. Full analysis next week.
In March 2023, GPT-4 cost $60 per million output tokens. Today, GPT-4o mini costs $0.60. Same-class quality, 100x cheaper. I made one chart. That's the whole article. Sometimes the data speaks for itself.
For the first time, a Google model sits at the top of the LMSYS leaderboard. I analyzed the vote patterns. Gemini 2.5 Pro dominates in coding and math. Claude still leads in creative tasks. The throne is now contestable.
Meta went mixture-of-experts with Llama 4. Scout is 17B active parameters from 109B total. Maverick is 17B from 400B. I benchmarked both against Llama 3.1 70B. The efficiency gains are exactly what the MoE math predicts.
o1, o3-mini, DeepSeek R1, Claude 3.7 Sonnet (thinking), Gemini 2.5 Pro, Grok 3, QwQ-32B, and Phi-4. Same 100 MATH problems. Same evaluation criteria. The spread is tighter than you'd expect from the marketing.
I surveyed 200 developers about their AI coding tools. Cursor has 34% usage share. Copilot dropped to 28%. Claude Code is at 12% and rising fast. The "winner take all" era is over.
Anthropic's Model Context Protocol is becoming the standard for AI tool use. I surveyed 30 MCP server implementations and counted the tools each provides. The median is 7 tools. The maximum is 94. More isn't always better.
Claude 3.7 Sonnet can toggle extended thinking on and off. I tested it in both modes across 200 prompts. With thinking on, it matches o1 on MATH. With thinking off, it's still the best general-purpose model on Chatbot Arena.
$75 per million input tokens. That's 500x more than GPT-4o mini. I ran GPT-4.5 through my evaluation suite. It's good. Really good. But at this price, it only makes economic sense for a very narrow set of tasks.
I counted 142 models released on Hugging Face in February 2025 alone. That's 5 per day. Downloads are up but download-per-model is falling. The attention pie is finite. I think a shakeout is coming.
Google added extended thinking to Gemini. I tested it against o1-preview and DeepSeek R1 on math and coding problems. Gemini 2.5 Pro wins on 4 of 6 benchmarks. Google is back in the reasoning race.
xAI built a 100K GPU cluster in Memphis. Grok 3 is the first model trained on it. The benchmarks are competitive with Claude 3.5 Sonnet and GPT-4o. I ran my standard evaluation. It's good, but the interesting story is the infrastructure bet.
AI agents sound cheap per token. But they loop. A lot. I measured the total token consumption for 50 real agent tasks across Claude, GPT-4o, and Gemini. The average task used 47K tokens. Some hit 200K+.
Six months at the top of the LMSYS leaderboard. I pulled the vote data and looked at the categories where Claude 3.5 Sonnet wins most decisively: coding (Elo 1290), creative writing (1285), and instruction following (1280).
Seven providers changed prices in January alone. Anthropic dropped Claude 3.5 Haiku's price. Google cut Gemini Flash. I updated the master table. The cheapest frontier-class model is now $0.10 per million input tokens.
When DeepSeek showed you could train a frontier model for $5.6M, NVIDIA lost $589 billion in market cap in a single day. I charted the stock movements of every major AI company. The repricing of "compute moats" was instant.
DeepSeek R1 matches o1 on math and coding benchmarks at a fraction of the cost. And they released the weights. I compared R1 against o1-preview on 200 reasoning problems. The scores are within 2 points on MATH and GPQA.
DeepSeek V3 reportedly cost $5.6M to train. GPT-4 allegedly cost $100M+. I dug into the technical report and the training efficiency numbers. If these costs are real, the frontier just got a lot more accessible.
I didn't predict reasoning models at all. I thought scale would keep winning. Instead, o1 showed that inference-time compute is a whole new axis. My biggest hit? Predicting open source would reach GPT-4 level by year end.
API prices fell 90%. Open source matched GPT-4. Reasoning models appeared. AI coding assistants went mainstream. I compiled 25 charts that tell the story of 2024's wild ride.
Gemini 2.0 Flash matches GPT-4o on most of my tests while being 3x faster and significantly cheaper. I ran my standard evaluation across 300 prompts. Google finally has a model that's both fast and good.
I tracked every notable model release in Q4 2024. Sixty-seven models from 23 organizations. That's nearly one model per day. The pace is unsustainable and I suspect a consolidation is coming.
I monitored 15 AI API providers for 30 days straight, logging latency, error rates, and uptime. The results are a mess. Anthropic has the best uptime. Groq has the best speed. Nobody has both.
Anthropic updated Claude 3.5 Sonnet and added computer use. I tested both the model improvements and the computer use capability. Model quality jumped noticeably. Computer use works about 60% of the time in my tests.
Reasoning models use more tokens to think. But if they get the answer right more often, the cost per CORRECT answer might actually be lower. I ran the math on 500 coding problems. The results surprised me.
Alibaba's Qwen 2.5 72B beats Llama 3.1 70B on my tests. It's also the best model for CJK languages by a wide margin. I benchmarked it in English, Chinese, and Japanese. The English results alone deserve attention.
OpenAI's o1 trades speed for accuracy by 'thinking' before answering. The math and coding benchmarks are way up, but the costs are 6x higher per task. I broke down the cost-per-correct-answer metric and it's actually competitive.
Every new model touts its SWE-bench score. I analyzed the test cases and found 23% of them can be 'solved' by a simple regex patch. The benchmark isn't wrong exactly, but it's not measuring what you think.
Updated master pricing table with 34 models from 9 providers. The cheapest useful model is now Gemini 1.5 Flash at $0.075/M input tokens. Three years ago that would've cost $60. I charted the deflation.
Running 405B parameters needs serious hardware. I priced out 4 configurations: AWS, Lambda Labs, self-hosted with 8xA100s, and 8xH100s. The monthly costs range from $4,200 to $31,000 depending on utilization.
Meta released a 405 billion parameter model under an open license. I ran it on 10 standard benchmarks and 5 of my own. It matches GPT-4 within margin of error on 7 of 15. This is a milestone.
GPT-4o mini costs 100x less than GPT-4 did at launch. I plotted the price per million tokens for OpenAI's best available model at each point in time. The curve is a cliff.
The mid-tier model just beat the flagship. I ran Claude 3.5 Sonnet through every test I used for Opus, and it wins on 71% of them. At $3/M tokens vs $15, the value math is absurd.
GitHub Copilot, Cursor, Cody, Continue, Tabnine, and 7 others. I used each one for a full week and tracked acceptance rates, bug rates, and time saved. Cursor surprised me. Copilot disappointed me.
Google says 1 million tokens. That's approximately 1,500 pages. I fed it actual long documents and tested retrieval at various depths. Performance degrades gracefully until about 800K, then falls off a cliff.
In January 2023, the Elo gap between the best open source model and GPT-4 was 200+ points. It's now about 50. I charted the convergence curve. At this rate, parity arrives in Q3 2024.
I asked 50 AI developers how they evaluate models. 73% said 'I just try it and see how it feels.' Only 12% run formal benchmarks. The industry is moving from data-driven evaluation to... vibes. I have mixed feelings about this.
OpenAI released GPT-4o at half the price of GPT-4 Turbo, with vision and audio included. I calculated the per-task costs across text, image, and audio. The audio pricing is suspiciously cheap.
Microsoft's Phi-3 Mini has 3.8 billion parameters and beats Llama 3 8B on several benchmarks. I ran it locally on a MacBook M2. The small model revolution is accelerating faster than the big model one.
A model 9x smaller is now better. I benchmarked Llama 3 8B against Llama 2 70B on 6 tasks. The small model wins on 4 of them. Training data quality is eating model size for breakfast.
I compiled specs and benchmarks for every AI accelerator announced in the last 12 months. NVIDIA H100, AMD MI300X, Google TPU v5e, Groq LPU, Intel Gaudi 3, and 8 others. The competition is finally real.
Haiku at $0.25/M tokens, Sonnet at $3, Opus at $15. Anthropic isn't just pricing models, they're pricing use cases. I compared the price-to-quality ratio across all three and the tiering makes perfect economic sense.
Claude 3 Opus matched or beat GPT-4 on most benchmarks, but the 'needle in a haystack' test is what got me. It detected that it was being tested. I ran my own version and the results are strange.
Mistral finally has a frontier model. I ran all three through my standard 300-prompt evaluation. Mistral Large is competitive but not quite there. The interesting part is where it wins: European languages.
Groq's custom chip hit 800 tokens/second on Mixtral 8x7B. I measured latency across 100 requests and compared to 5 other inference providers. Groq is 18x faster than the median. Speed changes what's possible.
I logged every API price change since January 2023. There have been 23 price drops across 8 providers. The average price of a million output tokens fell 78%. I've never seen deflation this fast in tech.
I set up Mixtral on a single A100 and benchmarked throughput. At 95 tokens/second, the cost per million tokens is $0.18. The OpenAI API charges $0.50. Open source inference is now genuinely cheaper.
I predicted open source would stay 2 years behind closed source. I was wrong by a lot. Llama 2 closed the gap in months. Here's my full scorecard for 2023.
GPT-4, Llama 2, Mistral, Claude 2, SDXL, and ChatGPT hitting 100M users. I compiled 20 charts that tell the story of 2023. This was the year AI stopped being a niche interest.
Mistral dropped Mixtral via a magnet link (no paper, no blog post, just a torrent). The benchmarks leaked within hours. A mixture-of-experts model at GPT-3.5 quality with 12B active parameters? The inference cost math is wild.
Google claims Gemini Ultra beats GPT-4 on 30 of 32 benchmarks. I dug into the methodology. They're comparing against launch-day GPT-4, not the current version. And some of the benchmark configurations are... creative.
I found evidence that at least 6 models on the Hugging Face leaderboard were trained on benchmark test data. When your test set is in the training data, your scores are meaningless. I built a simple check for this.
OpenAI just slashed GPT-4 prices by 3x with GPT-4 Turbo. I updated my master pricing comparison table. The gap between open source and closed source API costs is narrowing fast.
I surveyed 40 AI companies about GPU access. 78% reported 'severe constraints.' But cloud provider utilization data tells a slightly different story. Some companies have more H100s than they're admitting.
People keep asking how I stay on top of all these model releases. Here's my actual system: RSS feeds, arXiv alerts, a spreadsheet with 312 rows, and a Python script that checks Hugging Face daily.
In January 2023, 4K tokens was standard. By October, we've got 100K (Claude), 32K (GPT-4), and 128K (Anthropic internal). I charted the context window growth curve. It's exponential.
Claude 2's 100K context window is its killer feature. I tested it with documents of 10K, 25K, 50K, and 100K tokens. Retrieval accuracy drops from 97% to 71% as length increases, but that's still way better than chunking strategies.
A 7B parameter model outperforming a 13B model shouldn't be possible under simple scaling laws. But Mistral did it. I compared the benchmarks and the architecture differences that explain how.
LMSYS's crowdsourced Elo ratings are based on 200K+ human votes of blind model comparisons. I analyzed the vote distributions and demographic patterns. It's noisy, but it's the closest thing to 'what real users think.'
Meta says Llama 2 70B used 1.7M GPU hours of A100 time. At current cloud prices, that's roughly $5.4M. But Meta used their own hardware. I estimated the real cost and it's probably 60% less.
Open source models went from curiosity to contender in 18 months. I made 11 charts tracking downloads, benchmark scores, funding, and community growth. The trend line is unmistakable.
Can you actually save money running Llama 2 yourself instead of using the OpenAI API? I calculated it. The answer depends on your volume, but the break-even point is lower than I expected.
Meta released Llama 2 with a commercial license. I benchmarked the 70B model against GPT-3.5-turbo on 8 tasks. Llama 2 70B matches or beats GPT-3.5 on 5 of them. Open source just got real.
$12.4 billion in AI startup funding in Q1 2023 alone. That's more than all of 2020. I broke it down by category, stage, and geography. Generative AI is 73% of the total.
Every open source model now optimizes for the Hugging Face leaderboard. I checked: 12 of the top 20 models were specifically fine-tuned on leaderboard benchmark data. Goodhart's Law is hitting AI benchmarks hard.
Anthropic's Claude is in beta and I got access. I ran both models through 300 prompts across coding, writing, and reasoning. Claude wins on length and nuance. GPT-4 wins on accuracy. The data is tight.
GPT-4 costs $0.03/1K input tokens vs $0.002 for GPT-3.5-turbo. That's a 15x price jump. I ran 500 real-world tasks on both and measured quality. The value proposition is... complicated.
Meta's LLaMA was supposed to be research-only. It leaked within a week. Now everyone can benchmark it. I ran LLaMA-13B against GPT-3.5 on 5 tasks. The results are closer than Meta probably wanted.
Q4 2022 had 31 notable model releases. Q1 2023 is on pace for 58. The acceleration is real, and it's not just one company driving it. I categorized every single one.
OpenAI, Anthropic, Cohere, AI21 Labs, and Google all have LLM APIs now. I made a comparison table of every pricing tier. The spread is 47x between the cheapest and most expensive option.
Everyone is sharing GPT-4's bar exam score. Almost nobody is talking about the benchmarks where it barely beats GPT-3.5. I broke down all 23 benchmarks in the technical report. The picture is more mixed than the headlines suggest.
In January I made 10 predictions about AI in 2022. I got 4 right, 3 half-right, and 3 completely wrong. The biggest miss? I didn't predict ChatGPT would exist.
From DALL-E 2 to ChatGPT, 2022 was the year AI left the research lab. I compiled 15 charts that tell the story. The most important number? ChatGPT's 1M users in 5 days vs GPT-3's 300K waitlist after 6 months.
ChatGPT is based on GPT-3.5, but it behaves nothing like the raw API. I ran 200 identical prompts on both. ChatGPT refuses 23% of prompts that GPT-3 answers happily. RLHF changed more than people think.
I compared ChatGPT's user growth curve to Instagram, TikTok, Spotify, and Netflix. Nothing comes close. ChatGPT's first week makes every other consumer tech launch look slow.
Three months after Stable Diffusion's release, I counted 847 forks and derivative projects on GitHub. The rate of open source AI proliferation is unlike anything I've seen in tech.
Model outputs aren't static. I asked GPT-3 the same 50 factual questions monthly for 12 months. 17 answers changed. Some got better. Some got worse. 'Model drift' is real and measurable.
I compiled every dollar raised by AI safety organizations in 2022. The total is $1.9 billion. But 87% went to just two companies. The distribution is incredibly top-heavy.
I logged every Copilot suggestion for 6 months. Accepted 34.2% of them. The acceptance rate varies wildly by language: 52% for Python, 18% for Rust.
DeepMind's Chinchilla paper says most large models are undertrained. I ran the numbers: if Chinchilla's scaling laws are right, GPT-3 should have used 4.6x more training data. The implications are huge.
47 notable models in 9 months. I put them all in a table with release date, parameters, training data size, and whether they're open or closed. The pattern is hard to miss.
Stability AI released Stable Diffusion and suddenly image generation costs dropped from ~$0.02/image (DALL-E 2) to essentially free if you have a GPU. I calculated the break-even point.
Same 100 prompts, two models, blind rating by 5 people. Midjourney wins on 'aesthetic feel' 64% of the time. DALL-E 2 wins on 'prompt accuracy' 71% of the time. The data is fascinating.
BLOOM just launched. GPT-NeoX is out. I pulled download stats from Hugging Face for every open source LLM. The adoption curves are starting to look serious.
OpenAI's InstructGPT paper has fascinating details about the human labeler workforce. 40 contractors, 5 steps, and the data quality metrics that made RLHF work.
I've been generating the same 50 prompts on each new model as it releases. The quality jump from January to April 2022 is the steepest improvement curve I've ever plotted.
Every time a new model drops, the parameter count gets bigger and the context gets lost. I made a chart showing every major model's parameter count since 2018. PaLM is... a lot.
I surveyed 23 AI startup founders about their cloud compute bills. The median monthly GPU spend is $14,000. One is paying $200,000/month. The variance is absurd.
I generated 200 images across 10 categories and rated coherence, prompt adherence, and artifact frequency. DALL-E 2 is good, but 'good' means different things for different prompt types.
From GPT-3's pricing to GPU shortages to the rise of the Hugging Face model zoo. These are the 10 data points from 2021 that I think will matter most looking back.
I counted arXiv submissions with "artificial intelligence", "machine learning", and "deep learning" in the title. 2021 is on pace to smash 2020's record by 34%.
I scraped the Hugging Face model hub and categorized all 10,000+ models by type, language, and download count. Text generation is only 8% of the total. The real king is NER.
I plotted the estimated training costs of every major model from 2018 to 2021. The curve isn't going up linearly. It's doing something much weirder, and the inflection point was GPT-3.
I tracked GPU prices across eBay, Newegg, and Amazon for six months. The RTX 3090 hit 3x MSRP in February. Here's the full timeline with data.
EleutherAI released GPT-J-6B and I benchmarked it against GPT-3's comparable size. For a free model, the numbers are surprisingly close on some tasks.
OpenAI's Codex is in private beta and I got access. I ran 500 code generation requests and tracked the token costs. Generating a Python function costs about $0.003 on average.
127 AI startups raised funding in Q1 2021. I categorized all of them. The "generative AI" category barely exists yet. Most money is still going to enterprise ML tools.
OpenAI's DALL-E paper dropped in January and I've been collecting reaction data. The gap between what researchers expected and what it actually produces is measurable.
I've been tracking GPT-3 API access reports since launch. The waitlist data tells a story about who OpenAI is letting in first, and it's not random.
I went through 14 major benchmarks used in 2020 AI papers. Some are genuinely useful. Some are theater. Here's my ranking with the data to back it up.
I spent a weekend calculating the actual per-word cost of GPT-3's different engines. The price difference between Davinci and Ada is wild, and most people are using the wrong one.
LLM cost calculator, benchmark decoder, and model size visualizer. Built for people who care about the numbers.
Explore tools