What I've learned tracking AI data for 5 years

Five years.

I published my first article on January 18, 2021. "Wait, GPT-3 costs HOW much per token?" It was 1,200 words about the pricing of OpenAI's API tiers. I had a spreadsheet with 6 rows.

Today that spreadsheet has 4,200 rows. 147 articles. Three redesigns of the tracking system. More benchmark data than I know what to do with.

This is the most personal article I've written. Not about any model or benchmark. About what five years of staring at AI data has taught me.

The three things I got right

1. Open source would catch up

In article #7 (July 2021), I wrote about GPT-J-6B and said "for a free model, the numbers are surprisingly close." The open source community was just getting started.

I kept tracking the gap between open and closed source models. The gap narrowed from 200+ Elo points (2023) to under 30 (2025). I called the convergence correctly in my 2024 predictions.

What I didn't predict: the convergence was driven by Chinese labs (DeepSeek, Qwen), not by the Western open source community I was watching.

2. Prices would fall faster than anyone expected

My first pricing article tracked GPT-3's pricing tiers. $60/M tokens for Davinci. I thought that was expensive and would come down "maybe 50% in a few years."

It fell 400x. In five years. The pricing deflation curve is the single most dramatic data trend I've ever tracked in any domain. I called the direction right. I underestimated the magnitude by about 8x.

3. Benchmarks would break

In April 2023, I wrote "The Hugging Face Open LLM Leaderboard is becoming the de facto benchmark. That's a problem." I flagged contamination and overfitting early.

By 2025, every static benchmark I warned about was either saturated or contaminated. Chatbot Arena became the gold standard, just as the logic of evolving evaluation predicted.

The five things I got wrong

1. I underestimated reasoning models

When OpenAI launched o1 in September 2024, I thought it was a niche product. "Interesting but impractical at this price." I wrote a cautious article.

Then DeepSeek R1 showed reasoning could be cheap AND open source. Then every provider added thinking modes. Reasoning models weren't niche. They were a new approach that redefined what AI can do on hard problems.

My 2024 prediction scorecard gave reasoning a D grade. I should have seen the potential.

2. I thought model size would keep growing

I expected 10-trillion-parameter models by 2025. Instead, active parameter counts peaked around 280B (GPT-4) and started declining. MoE architectures made "how big" the wrong question. "How smart per active parameter" was the right one.

3. I predicted an AI winter scare

In early 2024, I thought investor sentiment would cool and trigger a mini-winter. It didn't happen. Funding grew every quarter. The application layer absorbed the hype as real products shipped.

4. I was wrong about hardware competition timing

I predicted AMD would be competitive with NVIDIA by 2024. It took until late 2025, and even then, "competitive" means 85-90% of performance, not parity. The software moat (CUDA) is deeper than I estimated.

5. I didn't predict the Chinese lab surge

In my 2023 roundups, Chinese AI labs barely featured. By 2025, three Chinese models were in the global top 10. DeepSeek became the most important training efficiency story in AI history. I had a blind spot for non-English AI research, and it cost me.

The one trend I still can't explain

Benchmark saturation keeps happening faster than anyone expects.

| Benchmark | Years from launch to saturation | |-----------|-------------------------------| | SuperGLUE | 2 years (2019-2021) | | MMLU | 3 years (2021-2024) | | HumanEval | 2 years (2021-2023) | | GSM8K | 2 years (2021-2023) | | MATH | 3 years (2021-2024) | | GPQA | Still going (launched 2023) |

Sources: arXiv, Papers With Code, benchmark papers.

Every benchmark gets saturated in 2-3 years. The community creates them faster, but models solve them faster too. We're in an arms race between benchmark creators and model trainers, and the models are winning.

I've been tracking this pattern for 4 years and I still can't build a good model for predicting when a benchmark will saturate. The tempo seems roughly constant (2-3 years) regardless of difficulty. GPQA, which requires PhD-level knowledge, is already at 80%+ for frontier models after just 3 years. At this rate, it saturates by 2027.

Why is the timeline so consistent? I don't know. It might be related to how fast benchmark questions diffuse into training data. It might be that model capability advances at a roughly constant rate that happens to match 2-3 years. I've been staring at this data and I can't tell.

What I wish someone had told me in 2021

| Advice | Why | |--------|-----| | Track active parameters, not total parameters | Total params became meaningless with MoE | | Price per correct answer > price per token | The only metric that ties to business value | | Read Chinese papers | Half the important work comes from there | | Don't trust sample-based estimates | Always run COUNT(*) on the full dataset | | The data is the story | Don't force narratives onto numbers that don't support them |

Where I go from here

My spreadsheet started as idle curiosity. "How much does this cost?" turned into "how fast is this changing?" turned into "what does this mean for the industry?"

I'll keep tracking. The data keeps surprising me. That's why I do this.

The field moves so fast that being wrong is guaranteed. What matters is being wrong in interesting ways and being honest about it.

147 articles. 89 pricing data points. 4,200 rows in my master spreadsheet. Five years.

My name is Data-kun. I count things. And I'm not done counting.

If you found this interesting, you might also like:

-- dataku