dataku

AI through the lens of data. Benchmarks, pricing trends, model comparisons. Let me show you something interesting in the numbers.

Benchmark Analysis Pricing Watch Model Comparisons Data Engineering Industry Trends

Latest

Industry TrendsMay 12, 20266 min read

Three Companies Now Control 90% of Frontier Inference

I counted every frontier model API available today. OpenAI, Anthropic, and Google serve roughly 90% of all production frontier inference. The concentration numbers are wild.

Data StoriesApr 22, 20263 min read

I counted every AI model released in Q1 2026

40+ models in 90 days. The pace is absurd. Here's the full count, month by month.

Pricing WatchApr 20, 20262 min read

The inference cost collapse, in one chart

AI inference costs dropped 100x in 3 years. I put it all in one table and the trend line is almost vertical.

Data StoriesApr 14, 20266 min read

What I've learned tracking AI data for 5 years

Five years of counting models, tracking prices, and benchmarking everything I can get my hands on. The three things I got right, the five things I got wrong, and the one trend I still can't explain. This is the most personal article I've written.

Pricing WatchApr 7, 20264 min read

The AI API price tracker: 5 years of data in one interactive chart

I've been tracking AI API prices since 2021. Today I'm publishing the full dataset: 89 price points across 12 providers over 5 years. The average cost per million tokens fell from $60 to $0.15. A 400x reduction. The chart tells the whole story.

Model ComparisonsMar 31, 20265 min read

Claude Opus 4.6 review: the 1M context model

Anthropic shipped a 1 million token context window on their flagship model. I tested retrieval at 100K, 250K, 500K, and 1M tokens. Accuracy stays above 90% up to 500K. At 1M it drops to 78%, but that's still usable. The long-context game has a new leader.

Benchmark AnalysisMar 24, 20265 min read

My monthly benchmark dashboard: March 2026 update

Monthly tracker updated. Claude Opus 4.5 still leads coding. Gemini 2.5 Ultra leads multimodal. o3 leads hard math. DeepSeek R2 leads cost-efficiency. New benchmark added: GPQA Diamond (graduate-level science questions). Full table inside.

Industry TrendsMar 17, 20264 min read

AI startup funding in Q1 2026: where the money is going

$18.2 billion in Q1 2026. I broke it down: 41% went to infrastructure (chips, cloud), 28% to application-layer companies, 19% to model providers, 12% to tooling. The big shift: application funding overtook model funding for the first time.

Model ComparisonsMar 10, 20264 min read

o4-mini vs Claude 4 Sonnet vs Gemini 2.5 Flash: the speed tier showdown

The "fast and cheap" tier is where the real competition is. I compared the three on 200 tasks optimizing for speed and cost, not peak quality. Gemini Flash wins on price. o4-mini wins on coding. Claude Sonnet wins on general quality.

Data StoriesMar 3, 20265 min read

The MCP server catalog: 4,000 tools and counting

I scraped every MCP server registry I could find. 4,127 servers, 28,000+ tools. The most popular category is "file system" tools. The fastest growing is "database" tools. I charted the growth curve since Anthropic launched the protocol.

Model ComparisonsFeb 24, 20264 min read

Gemini 2.5 Ultra: Google's best model vs the field

Google finally released Ultra-tier Gemini 2.5. I compared it against Claude Opus 4.5, GPT-4o, and DeepSeek R2 across 300 prompts. Gemini Ultra wins on multimodal tasks and long context. Claude wins on coding. The frontier is genuinely multi-polar now.

Industry TrendsFeb 17, 20264 min read

AI coding tools: 2026 market share data

Updated my developer survey with 400 respondents. Claude Code jumped to 22% usage share. Cursor held at 31%. Copilot dropped to 24%. The fastest growing? Windsurf at 8%, up from 2% six months ago.

Data StoriesFeb 10, 20264 min read

The AI inference market: 25 providers ranked by price, speed, and reliability

My most thorough inference provider comparison yet. 25 providers, 60 days of monitoring, 3 metrics. Cerebras leads on speed. Together AI leads on open source model selection. Anthropic leads on reliability. Full rankings and methodology inside.

Benchmark AnalysisFeb 3, 20264 min read

Claude Opus 4.5: Anthropic's latest flagship, benchmarked

Anthropic's newest model. I ran 300 prompts across coding, reasoning, writing, and analysis. Coding scores are the highest I've measured from any model. Reasoning matches o3 with thinking enabled. The gap between Sonnet and Opus has widened again.

Pricing WatchJan 27, 20265 min read

Every AI pricing change in Q4 2025, tracked

14 price changes from 9 providers in the last quarter. The big story: Google dropped Gemini 2.5 Flash to $0.05/M input tokens. That's essentially free. Updated master comparison table inside.

Model ComparisonsJan 20, 20264 min read

DeepSeek R2: the open source reasoning model that costs pennies

DeepSeek R2 matches o3 on math benchmarks at 1/20th the inference cost. I ran my standard 200-problem reasoning evaluation. R2 scores 91.2% on MATH vs o3's 93.7%. At $0.14 vs $2.80 per hard problem, the economics aren't even close.

Benchmark AnalysisJan 13, 20265 min read

The state of AI benchmarks in early 2026: what still works?

MMLU is saturated. HumanEval is gamed. SWE-bench has contamination issues. I reviewed 20 active benchmarks and rated each on reliability, relevance, and resistance to gaming. Only 4 scored above 7/10. Chatbot Arena is still the gold standard.

Data StoriesJan 6, 20263 min read

My 2025 prediction scorecard

I predicted open source would match GPT-4 by mid-2025. It happened by Q1. I predicted API prices would fall 50%. They fell 90%. My biggest miss: I underestimated how fast reasoning models would improve. Full scorecard inside.

Data StoriesDec 15, 20255 min read

2025 in AI data: the year quality beat scale

Model sizes stopped growing. Training costs dropped 80%. Open source reached parity. Reasoning models showed that how you think matters more than how much you know. I compiled 30 charts telling the story of 2025.

Industry TrendsNov 17, 20255 min read

AI hardware beyond NVIDIA: AMD, Intel, and custom silicon in 2025

AMD MI325X, Intel Gaudi 3, Google TPU v6, Amazon Trainium 2, and 5 startup chips. I compiled benchmark data where available. NVIDIA still leads, but the gap is 30%, not 300%. The moat is eroding.

Pricing WatchNov 10, 20255 min read

The price of intelligence: tracking AI API costs since 2020

I built a complete timeline of AI API pricing from GPT-3 beta in 2020 to today. 47 price points across 5 years. The cost curve looks like a waterfall. Quality went up 10x while prices fell 100x. I've never seen anything like it in any industry.

Model ComparisonsNov 3, 20255 min read

Claude Opus 4 vs GPT-4o vs Gemini 2.5 Pro: the definitive Q4 comparison

My most thorough three-way comparison yet. 500 prompts, 8 categories, 3 human raters. Claude wins coding and analysis. GPT-4o wins speed and multimodal. Gemini wins on long-context and cost. There's no single best model anymore.

Industry TrendsOct 27, 20255 min read

Small language models in production: who's deploying what

I surveyed 50 companies deploying LLMs in production. 62% use models under 13B parameters. The most popular: Llama 3.2 3B (18%), Phi-4 (14%), and Mistral 7B (12%). Small models aren't just for research anymore.

Benchmark AnalysisOct 20, 20255 min read

The LLM leaderboard is dead, long live the leaderboard

Hugging Face deprecated the Open LLM Leaderboard v1 and launched v2 with new benchmarks. I compared scores on both versions for 20 models. Some models dropped 15 points. The re-ranking is dramatic and some "top models" were just benchmark-optimized.

Data StoriesOct 13, 20255 min read

AI energy consumption data: the numbers are bigger than you think

I compiled power consumption data for AI training and inference from every source I could find. A single GPT-4 query uses about 10x the energy of a Google search. At current growth rates, AI could consume 3% of US electricity by 2028.

Industry TrendsOct 6, 20254 min read

NVIDIA B200 benchmarks are out. The inference economics just changed again.

The B200 delivers 2.5x the inference throughput of the H100 at roughly the same power consumption. I compared the per-token cost on B200 vs H100 vs H200. If you're running inference at scale, the upgrade pays for itself in 4 months.

Benchmark AnalysisSep 29, 20255 min read

My monthly benchmark dashboard: September 2025 update

Monthly update to my running comparison of 15 models across 8 benchmarks. Big movers: Gemini 2.5 Pro gained 8 points on MMLU-Pro. Claude Opus 4 still leads on HumanEval. New entrant: Mistral Large 3.

Pricing WatchSep 22, 20254 min read

o3 and the reasoning model cost problem

OpenAI's o3 uses up to 10x the tokens of a standard model to "think." On hard math problems, a single o3 query can cost $2. I measured the token consumption across 100 problems and the variance is massive: 500 tokens to 50,000.

Industry TrendsSep 15, 20255 min read

The true cost of building an AI product in 2025: data from 30 startups

I surveyed 30 AI startups about their monthly costs. Median API spend: $8,400. Median total infra: $23,000. But the distribution is bimodal. Some spend $500/month with open source. Some spend $200,000 on API calls alone.

Model ComparisonsSep 8, 20255 min read

Llama 4 405B vs Llama 3.1 405B: same size, very different model

Meta kept the size but changed the architecture. Llama 4 405B uses MoE, so only ~100B parameters are active. I benchmarked both on 10 tasks. Llama 4 is faster and scores 8-12% higher on coding. Training quality over brute force.

Data StoriesSep 1, 20254 min read

The context window race is slowing down. Here's why that's fine.

In 2024, context windows doubled every 3 months. In 2025, they've barely changed. 1M tokens from Google. 200K from Anthropic. The reason? Most real-world tasks don't need more than 50K tokens. I have the usage data.

Pricing WatchAug 25, 20254 min read

AI inference costs by country: why geography matters for API pricing

Some providers route inference through different regions. I measured latency and calculated effective costs from 5 countries. Running Claude from Japan costs the same as the US. Running a self-hosted model in India costs 30% less. The global pricing map is uneven.

Benchmark AnalysisAug 18, 20255 min read

Vision model benchmarks: who can actually read a chart?

I fed 50 real-world charts, tables, and diagrams to 8 multimodal models. Claude Opus 4 reads charts the most accurately at 89%. GPT-4o is at 82%. Gemini 2.5 Pro is at 85%. Most models struggle with handwritten text in images.

Benchmark AnalysisAug 11, 20254 min read

The cost per correct answer: a new way to compare models

Raw benchmark scores ignore cost. I calculated "cost per correct answer" across 500 questions for 10 models. The cheapest correct answer comes from Gemini 2.5 Flash at $0.0003. The most expensive is GPT-4.5 at $0.14. A 467x difference.

Model ComparisonsAug 4, 20255 min read

Claude Code vs Cursor vs Copilot Workspace: the AI coding war in data

I used all three on the same 20 real coding tasks. Claude Code completed 17. Cursor completed 15. Copilot Workspace completed 11. But completion rate isn't the whole story. I also tracked "time to working code" and "bugs introduced."

Data StoriesJul 28, 20254 min read

AI model release frequency by quarter: a 4-year chart

I've been counting notable model releases since Q1 2021. The quarterly total went from 8 to 67 to... 54 in Q2 2025. The first decline. I think we've hit peak model release rate. The era of consolidation begins.

Industry TrendsJul 21, 20254 min read

The H100 resale market is crashing. Pricing data from 6 months.

H100 GPU resale prices dropped 40% from their January peak. I tracked listings on 4 broker sites. The DeepSeek efficiency shock plus H200/B200 availability is creating a glut. Good news for startups.

Benchmark AnalysisJul 14, 20254 min read

The frontier model gap just closed. Five models within 20 Elo points.

For the first time, the top 5 models on Chatbot Arena are within 20 Elo points of each other. Claude Opus 4, GPT-4o, Gemini 2.5 Pro, Grok 3, and DeepSeek V3. I analyzed what "virtually tied" means for model selection.

Data StoriesJul 7, 20255 min read

AI API uptime in H1 2025: the reliability report

Six months of continuous monitoring across 15 API providers. Anthropic: 99.7% uptime. OpenAI: 99.3%. Google: 99.1%. The outage patterns are interesting. Mondays and Thursdays are the worst days. I have theories about why.

Model ComparisonsJun 30, 20255 min read

I tested 10 local LLM runtimes. Ollama vs LM Studio vs llama.cpp vs...

Local inference has gotten shockingly good. I tested 10 runtimes on the same hardware (M3 Max, 64GB). Ollama wins on ease of use. llama.cpp wins on raw speed. The performance gap between local and cloud is narrowing.

Industry TrendsJun 23, 20255 min read

The open weight model scene, mid-2025: who's winning?

Meta, Alibaba, Mistral, DeepSeek, and 12 others are all releasing open weight models. I ranked them by Chatbot Arena Elo, Hugging Face downloads, and community adoption. Llama still leads downloads, but Qwen is closing fast.

Pricing WatchJun 16, 20255 min read

How much does it cost to run a chatbot with 1M daily users? I did the math.

1 million daily users, 5 messages each, average 300 tokens per response. At Claude 4 Sonnet pricing, that's $4,500/day. At GPT-4o mini, it's $300/day. I modeled the economics for 6 different model tiers.

Benchmark AnalysisJun 9, 20255 min read

The SWE-bench Verified leaderboard: who's actually solving real bugs?

SWE-bench Verified filters out the easy problems. I compared scores on full SWE-bench vs Verified for 12 models. Some models drop 20+ points. The gap reveals who's gaming the benchmark vs who's actually good at coding.

Data StoriesJun 2, 20254 min read

AI model sizes are SHRINKING. Here's the data.

The biggest model released in 2025 so far has fewer parameters than GPT-4. Efficiency gains from MoE, distillation, and better training data mean the era of "bigger is better" is fading. I charted the trend.

Model ComparisonsMay 26, 20255 min read

Claude 4 Sonnet vs GPT-4o vs Gemini 2.5 Flash: the mid-tier model war

The mid-tier is where most developers actually work. I compared the three most popular "not-the-flagship" models on real-world tasks: summarization, extraction, classification, and code generation. Claude 4 Sonnet wins 3 of 4.

Data StoriesMay 19, 20255 min read

The inference provider market: latency, cost, and uptime for 20 providers

I expanded my monthly monitoring to 20 providers. The new additions: Cerebras, Fireworks, Baseten, Modal, and Replicate. Cerebras leads on latency. Fireworks leads on cost efficiency. Updated rankings inside.

Benchmark AnalysisMay 12, 20255 min read

The benchmark contamination problem is getting worse. New evidence.

I tested 15 models for memorization of MMLU questions. 4 of them could complete benchmark questions from the first few words alone. Contamination isn't just theoretical anymore. I can measure it.

Model ComparisonsMay 5, 20255 min read

AI agent frameworks: LangChain vs CrewAI vs Autogen. A data comparison.

I built the same 5 agent tasks on each framework and measured completion rates, token usage, and time to complete. LangChain is the most flexible. CrewAI finishes fastest. Autogen uses the fewest tokens. No clear winner.

Model ComparisonsApr 28, 20255 min read

Qwen3 and the Chinese model wave: benchmarking 5 models from China

Qwen3, DeepSeek V3, Yi-Lightning, Baichuan 4, and MiniMax-01. I benchmarked all five against Claude 3.7 Sonnet and GPT-4o. Chinese models now occupy 3 of the top 10 spots on Chatbot Arena. The geographic distribution of AI talent is shifting.

Model ComparisonsApr 21, 20255 min read

Claude Opus 4 is here. My first benchmark impressions.

Anthropic's new flagship model. Extended thinking, tool use, and code generation all feel meaningfully better. I ran my standard 300-prompt evaluation. Early data: it's the best model I've tested on coding tasks. Full analysis next week.

Pricing WatchApr 14, 20253 min read

The cost of AI dropped 97% in two years. One chart.

In March 2023, GPT-4 cost $60 per million output tokens. Today, GPT-4o mini costs $0.60. Same-class quality, 100x cheaper. I made one chart. That's the whole article. Sometimes the data speaks for itself.

Benchmark AnalysisApr 7, 20255 min read

Gemini 2.5 Pro just took #1 on Chatbot Arena. The data behind the shift.

For the first time, a Google model sits at the top of the LMSYS leaderboard. I analyzed the vote patterns. Gemini 2.5 Pro dominates in coding and math. Claude still leads in creative tasks. The throne is now contestable.

Model ComparisonsApr 5, 20255 min read

Llama 4 Scout and Maverick: Meta's MoE play, in data

Meta went mixture-of-experts with Llama 4. Scout is 17B active parameters from 109B total. Maverick is 17B from 400B. I benchmarked both against Llama 3.1 70B. The efficiency gains are exactly what the MoE math predicts.

Benchmark AnalysisMar 31, 20255 min read

I benchmarked 8 reasoning models on the same 100 math problems

o1, o3-mini, DeepSeek R1, Claude 3.7 Sonnet (thinking), Gemini 2.5 Pro, Grok 3, QwQ-32B, and Phi-4. Same 100 MATH problems. Same evaluation criteria. The spread is tighter than you'd expect from the marketing.

Industry TrendsMar 24, 20254 min read

The AI coding tool market is fragmenting. Here are the usage numbers.

I surveyed 200 developers about their AI coding tools. Cursor has 34% usage share. Copilot dropped to 28%. Claude Code is at 12% and rising fast. The "winner take all" era is over.

Data StoriesMar 17, 20254 min read

The MCP protocol: how many tools does an AI agent actually need?

Anthropic's Model Context Protocol is becoming the standard for AI tool use. I surveyed 30 MCP server implementations and counted the tools each provides. The median is 7 tools. The maximum is 94. More isn't always better.

Benchmark AnalysisMar 10, 20255 min read

Claude 3.7 Sonnet: Anthropic's hybrid thinking model, benchmarked

Claude 3.7 Sonnet can toggle extended thinking on and off. I tested it in both modes across 200 prompts. With thinking on, it matches o1 on MATH. With thinking off, it's still the best general-purpose model on Chatbot Arena.

Pricing WatchMar 3, 20255 min read

GPT-4.5 is the most expensive model ever released. Is it worth it?

$75 per million input tokens. That's 500x more than GPT-4o mini. I ran GPT-4.5 through my evaluation suite. It's good. Really good. But at this price, it only makes economic sense for a very narrow set of tasks.

Industry TrendsMar 3, 20254 min read

The open source model release velocity is unsustainable. Here's why.

I counted 142 models released on Hugging Face in February 2025 alone. That's 5 per day. Downloads are up but download-per-model is falling. The attention pie is finite. I think a shakeout is coming.

Model ComparisonsFeb 24, 20255 min read

Gemini 2.5 Pro and "thinking" models: Google's answer to o1

Google added extended thinking to Gemini. I tested it against o1-preview and DeepSeek R1 on math and coding problems. Gemini 2.5 Pro wins on 4 of 6 benchmarks. Google is back in the reasoning race.

Model ComparisonsFeb 17, 20255 min read

Grok 3 and the xAI compute cluster: throwing brute force at AI

xAI built a 100K GPU cluster in Memphis. Grok 3 is the first model trained on it. The benchmarks are competitive with Claude 3.5 Sonnet and GPT-4o. I ran my standard evaluation. It's good, but the interesting story is the infrastructure bet.

Pricing WatchFeb 10, 20255 min read

The real cost of AI agents: I tracked token usage for 50 agentic tasks

AI agents sound cheap per token. But they loop. A lot. I measured the total token consumption for 50 real agent tasks across Claude, GPT-4o, and Gemini. The average task used 47K tokens. Some hit 200K+.

Benchmark AnalysisFeb 3, 20254 min read

Claude 3.5 Sonnet is still #1 on Chatbot Arena. For how long?

Six months at the top of the LMSYS leaderboard. I pulled the vote data and looked at the categories where Claude 3.5 Sonnet wins most decisively: coding (Elo 1290), creative writing (1285), and instruction following (1280).

Pricing WatchFeb 3, 20254 min read

Every AI pricing change in January 2025, tracked

Seven providers changed prices in January alone. Anthropic dropped Claude 3.5 Haiku's price. Google cut Gemini Flash. I updated the master table. The cheapest frontier-class model is now $0.10 per million input tokens.

Industry TrendsJan 27, 20255 min read

The DeepSeek effect: AI stock prices dropped $1 trillion in a day. The data.

When DeepSeek showed you could train a frontier model for $5.6M, NVIDIA lost $589 billion in market cap in a single day. I charted the stock movements of every major AI company. The repricing of "compute moats" was instant.

Benchmark AnalysisJan 20, 20256 min read

DeepSeek R1 just broke every reasoning benchmark. And it's open source.

DeepSeek R1 matches o1 on math and coding benchmarks at a fraction of the cost. And they released the weights. I compared R1 against o1-preview on 200 reasoning problems. The scores are within 2 points on MATH and GPQA.

Model ComparisonsDec 26, 20247 min read

DeepSeek V3: a Chinese model that costs almost nothing to train

DeepSeek V3 reportedly cost $5.6M to train. GPT-4 allegedly cost $100M+. I dug into the technical report and the training efficiency numbers. If these costs are real, the frontier just got a lot more accessible.

Data StoriesDec 23, 20246 min read

My 2024 prediction scorecard: reasoning models were my biggest miss

I didn't predict reasoning models at all. I thought scale would keep winning. Instead, o1 showed that inference-time compute is a whole new axis. My biggest hit? Predicting open source would reach GPT-4 level by year end.

Data StoriesDec 16, 20249 min read

2024 AI data roundup: the year of commoditization

API prices fell 90%. Open source matched GPT-4. Reasoning models appeared. AI coding assistants went mainstream. I compiled 25 charts that tell the story of 2024's wild ride.

Model ComparisonsDec 11, 20246 min read

Google Gemini 2.0 Flash: the speed-to-quality ratio is unprecedented

Gemini 2.0 Flash matches GPT-4o on most of my tests while being 3x faster and significantly cheaper. I ran my standard evaluation across 300 prompts. Google finally has a model that's both fast and good.

Data StoriesNov 25, 20246 min read

The Q4 2024 model release tracker: 67 models in 90 days

I tracked every notable model release in Q4 2024. Sixty-seven models from 23 organizations. That's nearly one model per day. The pace is unsustainable and I suspect a consolidation is coming.

Data StoriesNov 4, 20247 min read

The state of AI APIs: speed, cost, and reliability across 15 providers

I monitored 15 AI API providers for 30 days straight, logging latency, error rates, and uptime. The results are a mess. Anthropic has the best uptime. Groq has the best speed. Nobody has both.

Model ComparisonsOct 22, 20246 min read

Claude 3.5 Sonnet (new) and computer use: my first benchmark data

Anthropic updated Claude 3.5 Sonnet and added computer use. I tested both the model improvements and the computer use capability. Model quality jumped noticeably. Computer use works about 60% of the time in my tests.

Pricing WatchOct 7, 20246 min read

The inference cost of reasoning models: o1 vs Claude 3.5 Sonnet per correct answer

Reasoning models use more tokens to think. But if they get the answer right more often, the cost per CORRECT answer might actually be lower. I ran the math on 500 coding problems. The results surprised me.

Model ComparisonsSep 19, 20246 min read

Qwen 2.5 is the best open source model nobody is talking about

Alibaba's Qwen 2.5 72B beats Llama 3.1 70B on my tests. It's also the best model for CJK languages by a wide margin. I benchmarked it in English, Chinese, and Japanese. The English results alone deserve attention.

Benchmark AnalysisSep 12, 20247 min read

o1 and 'reasoning' models: the benchmark scores look different this time

OpenAI's o1 trades speed for accuracy by 'thinking' before answering. The math and coding benchmarks are way up, but the costs are 6x higher per task. I broke down the cost-per-correct-answer metric and it's actually competitive.

Benchmark AnalysisSep 2, 20246 min read

The SWE-bench problem: are coding benchmarks measuring the right thing?

Every new model touts its SWE-bench score. I analyzed the test cases and found 23% of them can be 'solved' by a simple regex patch. The benchmark isn't wrong exactly, but it's not measuring what you think.

Pricing WatchAug 12, 20246 min read

OpenAI just launched their cheapest model. Here's every price tier compared.

Updated master pricing table with 34 models from 9 providers. The cheapest useful model is now Gemini 1.5 Flash at $0.075/M input tokens. Three years ago that would've cost $60. I charted the deflation.

Pricing WatchAug 5, 20247 min read

The cost of running Llama 3.1 405B: cloud vs self-hosted, the full math

Running 405B parameters needs serious hardware. I priced out 4 configurations: AWS, Lambda Labs, self-hosted with 8xA100s, and 8xH100s. The monthly costs range from $4,200 to $31,000 depending on utilization.

Benchmark AnalysisJul 23, 20246 min read

Llama 3.1 405B: the first truly GPT-4 class open model. My benchmark data.

Meta released a 405 billion parameter model under an open license. I ran it on 10 standard benchmarks and 5 of my own. It matches GPT-4 within margin of error on 7 of 15. This is a milestone.

Pricing WatchJul 18, 20246 min read

GPT-4o mini is $0.15 per million tokens. The race to the bottom is real.

GPT-4o mini costs 100x less than GPT-4 did at launch. I plotted the price per million tokens for OpenAI's best available model at each point in time. The curve is a cliff.

Model ComparisonsJun 20, 20246 min read

Claude 3.5 Sonnet is better than Claude 3 Opus. And it's 5x cheaper.

The mid-tier model just beat the flagship. I ran Claude 3.5 Sonnet through every test I used for Opus, and it wins on 71% of them. At $3/M tokens vs $15, the value math is absurd.

Model ComparisonsJun 10, 20247 min read

I benchmarked 12 coding assistants. Cursor is not what I expected.

GitHub Copilot, Cursor, Cody, Continue, Tabnine, and 7 others. I used each one for a full week and tracked acceptance rates, bug rates, and time saved. Cursor surprised me. Copilot disappointed me.

Benchmark AnalysisJun 3, 20247 min read

Gemini 1.5 Pro has a 1 million token context window. I tested it with real documents.

Google says 1 million tokens. That's approximately 1,500 pages. I fed it actual long documents and tested retrieval at various depths. Performance degrades gracefully until about 800K, then falls off a cliff.

Benchmark AnalysisMay 27, 20246 min read

The LMSYS Elo gap between open and closed source models just shrank to 50 points

In January 2023, the Elo gap between the best open source model and GPT-4 was 200+ points. It's now about 50. I charted the convergence curve. At this rate, parity arrives in Q3 2024.

Data StoriesMay 20, 20246 min read

The 'vibe check' era: why benchmarks are losing to vibes

I asked 50 AI developers how they evaluate models. 73% said 'I just try it and see how it feels.' Only 12% run formal benchmarks. The industry is moving from data-driven evaluation to... vibes. I have mixed feelings about this.

Pricing WatchMay 13, 20247 min read

GPT-4o is multimodal AND cheaper. I have questions about the pricing.

OpenAI released GPT-4o at half the price of GPT-4 Turbo, with vision and audio included. I calculated the per-task costs across text, image, and audio. The audio pricing is suspiciously cheap.

Benchmark AnalysisApr 23, 20246 min read

Phi-3 Mini is a 3.8B model that's shockingly good. Small model benchmarks.

Microsoft's Phi-3 Mini has 3.8 billion parameters and beats Llama 3 8B on several benchmarks. I ran it locally on a MacBook M2. The small model revolution is accelerating faster than the big model one.

Model ComparisonsApr 18, 20246 min read

Llama 3 8B beats Llama 2 70B. Let that sink in.

A model 9x smaller is now better. I benchmarked Llama 3 8B against Llama 2 70B on 6 tasks. The small model wins on 4 of them. Training data quality is eating model size for breakfast.

Industry TrendsMar 25, 20248 min read

The AI chip market in 2024: not just NVIDIA anymore

I compiled specs and benchmarks for every AI accelerator announced in the last 12 months. NVIDIA H100, AMD MI300X, Google TPU v5e, Groq LPU, Intel Gaudi 3, and 8 others. The competition is finally real.

Pricing WatchMar 11, 20246 min read

The Claude 3 model family pricing is actually brilliant. Here's why.

Haiku at $0.25/M tokens, Sonnet at $3, Opus at $15. Anthropic isn't just pricing models, they're pricing use cases. I compared the price-to-quality ratio across all three and the tiering makes perfect economic sense.

Benchmark AnalysisMar 4, 20247 min read

Claude 3 Opus is the first model to genuinely worry me about benchmarks

Claude 3 Opus matched or beat GPT-4 on most benchmarks, but the 'needle in a haystack' test is what got me. It detected that it was being tested. I ran my own version and the results are strange.

Model ComparisonsFeb 26, 20246 min read

Mistral Large vs GPT-4 vs Claude 3 Opus: the three-way benchmark

Mistral finally has a frontier model. I ran all three through my standard 300-prompt evaluation. Mistral Large is competitive but not quite there. The interesting part is where it wins: European languages.

Data StoriesFeb 19, 20246 min read

Groq's LPU just served me 800 tokens per second. The inference speed data.

Groq's custom chip hit 800 tokens/second on Mixtral 8x7B. I measured latency across 100 requests and compared to 5 other inference providers. Groq is 18x faster than the median. Speed changes what's possible.

Pricing WatchJan 22, 20246 min read

Every LLM API price drop in the last 12 months, in one chart

I logged every API price change since January 2023. There have been 23 price drops across 8 providers. The average price of a million output tokens fell 78%. I've never seen deflation this fast in tech.

Pricing WatchJan 8, 20245 min read

Mixtral 8x7B is free to run and matches GPT-3.5. The inference economics are changing.

I set up Mixtral on a single A100 and benchmarked throughput. At 95 tokens/second, the cost per million tokens is $0.18. The OpenAI API charges $0.50. Open source inference is now genuinely cheaper.

Data StoriesDec 26, 20236 min read

My 2023 prediction scorecard

I predicted open source would stay 2 years behind closed source. I was wrong by a lot. Llama 2 closed the gap in months. Here's my full scorecard for 2023.

Data StoriesDec 18, 202310 min read

2023 AI data roundup: the year the dam broke

GPT-4, Llama 2, Mistral, Claude 2, SDXL, and ChatGPT hitting 100M users. I compiled 20 charts that tell the story of 2023. This was the year AI stopped being a niche interest.

Model ComparisonsDec 11, 20236 min read

Mixtral 8x7B: the MoE model that changes the economics of inference

Mistral dropped Mixtral via a magnet link (no paper, no blog post, just a torrent). The benchmarks leaked within hours. A mixture-of-experts model at GPT-3.5 quality with 12B active parameters? The inference cost math is wild.

Benchmark AnalysisDec 6, 20236 min read

Google Gemini benchmarks vs GPT-4: reading the fine print

Google claims Gemini Ultra beats GPT-4 on 30 of 32 benchmarks. I dug into the methodology. They're comparing against launch-day GPT-4, not the current version. And some of the benchmark configurations are... creative.

Benchmark AnalysisNov 20, 20236 min read

The 'contamination' problem: when benchmarks stop meaning anything

I found evidence that at least 6 models on the Hugging Face leaderboard were trained on benchmark test data. When your test set is in the training data, your scores are meaningless. I built a simple check for this.

Pricing WatchNov 6, 20236 min read

GPT-4 Turbo is 3x cheaper. Here's what that means for the API pricing war.

OpenAI just slashed GPT-4 prices by 3x with GPT-4 Turbo. I updated my master pricing comparison table. The gap between open source and closed source API costs is narrowing fast.

Industry TrendsOct 30, 20236 min read

The GPU shortage data: who has capacity and who's lying about it

I surveyed 40 AI companies about GPU access. 78% reported 'severe constraints.' But cloud provider utilization data tells a slightly different story. Some companies have more H100s than they're admitting.

Data StoriesOct 23, 20237 min read

How I track AI model releases: my personal data system

People keep asking how I stay on top of all these model releases. Here's my actual system: RSS feeds, arXiv alerts, a spreadsheet with 312 rows, and a Python script that checks Hugging Face daily.

Data StoriesOct 16, 20236 min read

Every major LLM's context window, charted over time

In January 2023, 4K tokens was standard. By October, we've got 100K (Claude), 32K (GPT-4), and 128K (Anthropic internal). I charted the context window growth curve. It's exponential.

Benchmark AnalysisOct 2, 20236 min read

Claude 2 is surprisingly good at long documents. Here's my data.

Claude 2's 100K context window is its killer feature. I tested it with documents of 10K, 25K, 50K, and 100K tokens. Retrieval accuracy drops from 97% to 71% as length increases, but that's still way better than chunking strategies.

Model ComparisonsSep 27, 20236 min read

Mistral 7B just beat Llama 2 13B. Small models are getting weird.

A 7B parameter model outperforming a 13B model shouldn't be possible under simple scaling laws. But Mistral did it. I compared the benchmarks and the architecture differences that explain how.

Benchmark AnalysisSep 11, 20236 min read

LMSYS Chatbot Arena has 200K votes. It might be the best benchmark we have.

LMSYS's crowdsourced Elo ratings are based on 200K+ human votes of blind model comparisons. I analyzed the vote distributions and demographic patterns. It's noisy, but it's the closest thing to 'what real users think.'

Pricing WatchAug 28, 20235 min read

The real cost of training Llama 2: Meta's numbers vs my estimates

Meta says Llama 2 70B used 1.7M GPU hours of A100 time. At current cloud prices, that's roughly $5.4M. But Meta used their own hardware. I estimated the real cost and it's probably 60% less.

Data StoriesAug 14, 20239 min read

11 charts that explain the open source AI wave

Open source models went from curiosity to contender in 18 months. I made 11 charts tracking downloads, benchmark scores, funding, and community growth. The trend line is unmistakable.

Pricing WatchJul 31, 20236 min read

The cost of self-hosting vs API: a real comparison for Llama 2

Can you actually save money running Llama 2 yourself instead of using the OpenAI API? I calculated it. The answer depends on your volume, but the break-even point is lower than I expected.

Model ComparisonsJul 19, 20236 min read

Llama 2 is here and it's actually good. My benchmark data.

Meta released Llama 2 with a commercial license. I benchmarked the 70B model against GPT-3.5-turbo on 8 tasks. Llama 2 70B matches or beats GPT-3.5 on 5 of them. Open source just got real.

Industry TrendsMay 15, 20236 min read

AI funding in Q1 2023 is absolutely bonkers. Let me show you the numbers.

$12.4 billion in AI startup funding in Q1 2023 alone. That's more than all of 2020. I broke it down by category, stage, and geography. Generative AI is 73% of the total.

Benchmark AnalysisApr 24, 20235 min read

The Hugging Face Open LLM Leaderboard is becoming the de facto benchmark. That's a problem.

Every open source model now optimizes for the Hugging Face leaderboard. I checked: 12 of the top 20 models were specifically fine-tuned on leaderboard benchmark data. Goodhart's Law is hitting AI benchmarks hard.

Model ComparisonsApr 10, 20237 min read

Claude vs GPT-4: my first head-to-head data comparison

Anthropic's Claude is in beta and I got access. I ran both models through 300 prompts across coding, writing, and reasoning. Claude wins on length and nuance. GPT-4 wins on accuracy. The data is tight.

Pricing WatchMar 20, 20236 min read

GPT-4 is 10x more expensive than GPT-3.5. Is it 10x better?

GPT-4 costs $0.03/1K input tokens vs $0.002 for GPT-3.5-turbo. That's a 15x price jump. I ran 500 real-world tasks on both and measured quality. The value proposition is... complicated.

Model ComparisonsMar 6, 20235 min read

LLaMA leaked. Here's what Meta's model weights actually look like.

Meta's LLaMA was supposed to be research-only. It leaked within a week. Now everyone can benchmark it. I ran LLaMA-13B against GPT-3.5 on 5 tasks. The results are closer than Meta probably wanted.

Data StoriesFeb 20, 20236 min read

I counted every AI model released this quarter. Here's what I found.

Q4 2022 had 31 notable model releases. Q1 2023 is on pace for 58. The acceleration is real, and it's not just one company driving it. I categorized every single one.

Pricing WatchJan 30, 20235 min read

The LLM pricing war just started. Here's every provider's cost per token.

OpenAI, Anthropic, Cohere, AI21 Labs, and Google all have LLM APIs now. I made a comparison table of every pricing tier. The spread is 47x between the cheapest and most expensive option.

Benchmark AnalysisJan 9, 20237 min read

GPT-4 benchmark scores are insane. But let me show you the fine print.

Everyone is sharing GPT-4's bar exam score. Almost nobody is talking about the benchmarks where it barely beats GPT-3.5. I broke down all 23 benchmarks in the technical report. The picture is more mixed than the headlines suggest.

Data StoriesDec 31, 20226 min read

My 2022 prediction scorecard: how wrong was I?

In January I made 10 predictions about AI in 2022. I got 4 right, 3 half-right, and 3 completely wrong. The biggest miss? I didn't predict ChatGPT would exist.

Data StoriesDec 28, 20228 min read

2022 in AI data: the year everything accelerated

From DALL-E 2 to ChatGPT, 2022 was the year AI left the research lab. I compiled 15 charts that tell the story. The most important number? ChatGPT's 1M users in 5 days vs GPT-3's 300K waitlist after 6 months.

Model ComparisonsDec 12, 20226 min read

ChatGPT vs GPT-3: same model family, wildly different results. The data.

ChatGPT is based on GPT-3.5, but it behaves nothing like the raw API. I ran 200 identical prompts on both. ChatGPT refuses 23% of prompts that GPT-3 answers happily. RLHF changed more than people think.

Data StoriesDec 5, 20226 min read

ChatGPT hit 1 million users in 5 days. Here's the growth data in context.

I compared ChatGPT's user growth curve to Instagram, TikTok, Spotify, and Netflix. Nothing comes close. ChatGPT's first week makes every other consumer tech launch look slow.

Industry TrendsDec 5, 20226 min read

Wait, Stable Diffusion has HOW many forks? The open source explosion in numbers.

Three months after Stable Diffusion's release, I counted 847 forks and derivative projects on GitHub. The rate of open source AI proliferation is unlike anything I've seen in tech.

Benchmark AnalysisNov 14, 20226 min read

I ran GPT-3 on the same 50 questions every month for a year. Here's the drift.

Model outputs aren't static. I asked GPT-3 the same 50 factual questions monthly for 12 months. 17 answers changed. Some got better. Some got worse. 'Model drift' is real and measurable.

Industry TrendsOct 31, 20226 min read

Anthropic just raised $580M. Let's talk about the AI safety funding numbers.

I compiled every dollar raised by AI safety organizations in 2022. The total is $1.9 billion. But 87% went to just two companies. The distribution is incredibly top-heavy.

Data StoriesOct 10, 20226 min read

GitHub Copilot: 6 months of usage data from my own coding

I logged every Copilot suggestion for 6 months. Accepted 34.2% of them. The acceptance rate varies wildly by language: 52% for Python, 18% for Rust.

Benchmark AnalysisSep 26, 20227 min read

The Chinchilla scaling laws changed everything. Let me show you why.

DeepMind's Chinchilla paper says most large models are undertrained. I ran the numbers: if Chinchilla's scaling laws are right, GPT-3 should have used 4.6x more training data. The implications are huge.

Data StoriesSep 12, 20227 min read

Every model released in 2022 so far, in one table

47 notable models in 9 months. I put them all in a table with release date, parameters, training data size, and whether they're open or closed. The pattern is hard to miss.

Pricing WatchAug 22, 20225 min read

Stable Diffusion is free. The pricing math of open source image generation.

Stability AI released Stable Diffusion and suddenly image generation costs dropped from ~$0.02/image (DALL-E 2) to essentially free if you have a GPU. I calculated the break-even point.

Model ComparisonsJul 11, 20226 min read

Midjourney v3 vs DALL-E 2: 100 prompts, head to head

Same 100 prompts, two models, blind rating by 5 people. Midjourney wins on 'aesthetic feel' 64% of the time. DALL-E 2 wins on 'prompt accuracy' 71% of the time. The data is fascinating.

Industry TrendsJun 6, 20226 min read

Open source AI is having a moment. Here are the download numbers.

BLOOM just launched. GPT-NeoX is out. I pulled download stats from Hugging Face for every open source LLM. The adoption curves are starting to look serious.

Benchmark AnalysisMay 9, 20227 min read

InstructGPT and RLHF: what the training data tells us

OpenAI's InstructGPT paper has fascinating details about the human labeler workforce. 40 contractors, 5 steps, and the data quality metrics that made RLHF work.

Data StoriesApr 18, 20226 min read

I tracked AI image generation quality over 6 months. The improvement rate is scary.

I've been generating the same 50 prompts on each new model as it releases. The quality jump from January to April 2022 is the steepest improvement curve I've ever plotted.

Model ComparisonsMar 21, 20225 min read

Google's PaLM has 540 billion parameters. Let me put that number in context.

Every time a new model drops, the parameter count gets bigger and the context gets lost. I made a chart showing every major model's parameter count since 2018. PaLM is... a lot.

Pricing WatchFeb 14, 20225 min read

The cost of running an AI startup in 2022: a data breakdown

I surveyed 23 AI startup founders about their cloud compute bills. The median monthly GPU spend is $14,000. One is paying $200,000/month. The variance is absurd.

Benchmark AnalysisJan 24, 20226 min read

DALL-E 2 is out. I ran 200 prompts and measured the results.

I generated 200 images across 10 categories and rated coherence, prompt adherence, and artifact frequency. DALL-E 2 is good, but 'good' means different things for different prompt types.

Data StoriesDec 27, 20217 min read

My 2021 AI data roundup: the 10 numbers that mattered most

From GPT-3's pricing to GPU shortages to the rise of the Hugging Face model zoo. These are the 10 data points from 2021 that I think will matter most looking back.

Data StoriesNov 8, 20213 min read

AI research papers published in 2021: a mid-year count

I counted arXiv submissions with "artificial intelligence", "machine learning", and "deep learning" in the title. 2021 is on pace to smash 2020's record by 34%.

Industry TrendsOct 11, 20216 min read

Hugging Face just hit 10,000 models. Here's what the model zoo looks like.

I scraped the Hugging Face model hub and categorized all 10,000+ models by type, language, and download count. Text generation is only 8% of the total. The real king is NER.

Data StoriesSep 20, 20216 min read

The training cost curve is doing something weird

I plotted the estimated training costs of every major model from 2018 to 2021. The curve isn't going up linearly. It's doing something much weirder, and the inflection point was GPT-3.

Data StoriesAug 16, 20217 min read

5 charts that explain why GPU prices went insane in 2021

I tracked GPU prices across eBay, Newegg, and Amazon for six months. The RTX 3090 hit 3x MSRP in February. Here's the full timeline with data.

Model ComparisonsJul 26, 20217 min read

GPT-3 vs GPT-J: the first real open source challenger, in data

EleutherAI released GPT-J-6B and I benchmarked it against GPT-3's comparable size. For a free model, the numbers are surprisingly close on some tasks.

Pricing WatchJun 14, 20215 min read

Codex and the cost of code generation: my first pricing analysis

OpenAI's Codex is in private beta and I got access. I ran 500 code generation requests and tracked the token costs. Generating a Python function costs about $0.003 on average.

Industry TrendsMay 20, 20216 min read

I counted every AI startup that raised money in Q1 2021. The numbers are strange.

127 AI startups raised funding in Q1 2021. I categorized all of them. The "generative AI" category barely exists yet. Most money is still going to enterprise ML tools.

Model ComparisonsApr 12, 20215 min read

DALL-E's first images vs what people expected: a data comparison

OpenAI's DALL-E paper dropped in January and I've been collecting reaction data. The gap between what researchers expected and what it actually produces is measurable.

Industry TrendsMar 15, 20213 min read

The GPT-3 API waitlist is 6 months long. Here's what the early data looks like.

I've been tracking GPT-3 API access reports since launch. The waitlist data tells a story about who OpenAI is letting in first, and it's not random.

Benchmark AnalysisFeb 22, 20219 min read

Every AI benchmark from 2020, ranked by how much they actually tell you

I went through 14 major benchmarks used in 2020 AI papers. Some are genuinely useful. Some are theater. Here's my ranking with the data to back it up.

Pricing WatchJan 18, 20214 min read

Wait, GPT-3 costs HOW much per token?

I spent a weekend calculating the actual per-word cost of GPT-3's different engines. The price difference between Davinci and Ada is wild, and most people are using the wrong one.

Free AI Data Tools

LLM cost calculator, benchmark decoder, and model size visualizer. Built for people who care about the numbers.

Explore tools