Data StoriesDecember 23, 20246 min read

My 2024 prediction scorecard: reasoning models were my biggest miss

I didn't predict reasoning models at all. I thought scale would keep winning. Instead, o1 showed that inference-time compute is a whole new axis. My biggest hit? Predicting open source would reach GPT-4 level by year end.

Annual tradition. Ten predictions, honest scoring, public accountability. Let's see how Data-kun did in 2024.

(Spoiler: better than 2023 on quantity, worse on my biggest miss.)

The scorecard

| # | Prediction (from Dec 31, 2023) | Result | Score | |---|-------------------------------|--------|-------| | 1 | GPT-5 (or equivalent next-gen OpenAI model) will launch in 2024 | GPT-4o launched (upgrade, not generation). o1 launched (different approach entirely). No "GPT-5." | Half | | 2 | An open source model will match GPT-4 on MMLU | Llama 3.1 405B: 87.3% MMLU. GPT-4 Turbo: 86.4%. Yes. | Right | | 3 | Total LLM API cost for GPT-4-quality will drop below $5/M output tokens | Fireworks AI Llama 3.1 405B: $3.00/M output tokens. Yes. | Right | | 4 | At least one AI company valued at $1B+ will fail or be acquired at a discount | Stability AI had massive layoffs and leadership chaos. Inflection AI talent was absorbed by Microsoft. Both count. | Right | | 5 | Context windows will exceed 1 million tokens for at least one commercial model | Google Gemini 1.5 Pro: 1 million tokens. February 2024. Easy. | Right | | 6 | AI-generated video will go from demo to product | Sora announced but not widely available. Runway Gen-3, Kling, and others are products but limited. I'll call this half. | Half | | 7 | Total AI VC funding will be $20-35B | Estimated at $65-70B including massive rounds for OpenAI, Anthropic, xAI. Way above my range. | Wrong | | 8 | The EU AI Act will cause at least one major AI product to restrict European access | No major product restricted European access specifically because of the AI Act in 2024. The Act passed but enforcement hasn't started. | Wrong | | 9 | Mistral AI will become a top-5 AI company by market presence | Mistral is arguably #5-6, behind OpenAI, Anthropic, Google, Meta, and arguably tied with Alibaba/Qwen. I'll be generous and call this half. | Half | | 10 | I will need to expand my tracking spreadsheet to over 500 models | My spreadsheet has 487 models as of December. Close but not 500. | Wrong |

Final score: 4 right, 3 half-right, 3 wrong.

Year-over-year accuracy

| Year | Right | Half | Wrong | Hit rate (right + half*0.5) | |------|-------|------|-------|---------------------------| | 2020 | 5 | 2 | 3 | 60% | | 2021 | 4 | 4 | 2 | 60% | | 2022 | 4 | 3 | 3 | 55% | | 2023 | 5 | 2 | 3 | 60% | | 2024 | 4 | 3 | 3 | 55% |

Back to 55%. My accuracy range across five years: 55-60%. Consistently mediocre. At least I'm consistent.

My biggest miss: not predicting reasoning models

I predicted GPT-5 would be a bigger, better version of GPT-4. More parameters, more training data, higher benchmarks across the board. The scaling approach continuing upward.

Instead, OpenAI shipped o1, which isn't bigger at all. It's the same size (roughly GPT-4 class) but thinks longer before answering. Inference-time compute, not training-time scale.

This was a fundamental conceptual miss, not a detail miss. I had the wrong mental model of how AI progress would work in 2024.

What o1 showed: you don't need a bigger model to get dramatically better answers. You need a model that spends more time reasoning. MATH went from 60% (GPT-4o) to 83% (o1-preview) without any increase in model size.

I should have seen this coming. The "let's think step by step" prompting trick has been around since 2022. Chain-of-thought reasoning was well-established. The logical next step was training a model to do it automatically. But I was so fixated on "bigger models" that I missed "smarter inference."

My biggest hit: open source reaching GPT-4

Prediction 2 was "an open source model will match GPT-4 on MMLU." This was my highest-conviction prediction, and it landed exactly right.

Llama 3.1 405B scored 87.3% on MMLU, beating GPT-4 Turbo's 86.4%. On 9 of 10 standard benchmarks, the open model matched or exceeded the closed one.

I predicted this because the Chinchilla scaling laws plus Meta's resources made it inevitable. Meta had the compute, the data pipeline, and the strategic motivation (they want AI to be a commodity, not a moat for competitors).

My other big miss: AI funding

I predicted $20-35B in AI VC funding. The actual number was roughly $65-70B. I underestimated because I thought the 2023 funding pace ($28.9B) would moderate.

Instead, the largest AI funding rounds in history happened in 2024:

| Company | Round | Amount | Date | |---------|-------|--------|------| | OpenAI | Series ? | ~$6.6B | Oct 2024 | | Anthropic | Various | ~$7.3B total 2024 | Throughout | | xAI | Series B | $6B | Dec 2024 | | Multiple others | Various | ~$45B total | Throughout |

Sources: Press reporting, Crunchbase, PitchBook estimates.

Three companies alone raised ~$20B. The long tail of smaller AI companies raised another ~$45B. My range was off by nearly 2x.

2025 predictions

Ten predictions for 2025, to be scored next December:

  1. At least one AI agent will be used in production by a Fortune 500 company for customer-facing tasks (not just internal copilots)
  2. The cost of frontier-quality inference will drop below $1/M output tokens
  3. An open source reasoning model will match o1-preview on MATH (above 80%)
  4. AI coding assistants will be used by over 60% of professional developers
  5. At least one major AI company will be acquired for $5B+
  6. Context windows exceeding 1M tokens will become standard (at least 3 providers)
  7. A model with under 10B parameters will score 85%+ on MMLU
  8. Total AI VC funding will be $40-80B (wider range this time, learned my lesson)
  9. DeepSeek or another Chinese lab will release a model that tops the LMSYS Chatbot Arena
  10. I will finally build a proper dashboard for my tracking data instead of using spreadsheets (most ambitious prediction)

Prediction 3 is my highest conviction. DeepSeek R1 or a similar open reasoning model is coming, and the bar (83% on MATH) is reachable.

Prediction 10 is my lowest conviction. I've been saying "I should build a dashboard" since 2022.

See you next December. The spreadsheet is ready. The hubris is calibrated.


If you found this interesting, you might also like:

-- dataku

More from dataku