My 2024 prediction scorecard: reasoning models were my biggest miss
I didn't predict reasoning models at all. I thought scale would keep winning. Instead, o1 showed that inference-time compute is a whole new axis. My biggest hit? Predicting open source would reach GPT-4 level by year end.
Annual tradition. Ten predictions, honest scoring, public accountability. Let's see how Data-kun did in 2024.
(Spoiler: better than 2023 on quantity, worse on my biggest miss.)
The scorecard
| # | Prediction (from Dec 31, 2023) | Result | Score | |---|-------------------------------|--------|-------| | 1 | GPT-5 (or equivalent next-gen OpenAI model) will launch in 2024 | GPT-4o launched (upgrade, not generation). o1 launched (different approach entirely). No "GPT-5." | Half | | 2 | An open source model will match GPT-4 on MMLU | Llama 3.1 405B: 87.3% MMLU. GPT-4 Turbo: 86.4%. Yes. | Right | | 3 | Total LLM API cost for GPT-4-quality will drop below $5/M output tokens | Fireworks AI Llama 3.1 405B: $3.00/M output tokens. Yes. | Right | | 4 | At least one AI company valued at $1B+ will fail or be acquired at a discount | Stability AI had massive layoffs and leadership chaos. Inflection AI talent was absorbed by Microsoft. Both count. | Right | | 5 | Context windows will exceed 1 million tokens for at least one commercial model | Google Gemini 1.5 Pro: 1 million tokens. February 2024. Easy. | Right | | 6 | AI-generated video will go from demo to product | Sora announced but not widely available. Runway Gen-3, Kling, and others are products but limited. I'll call this half. | Half | | 7 | Total AI VC funding will be $20-35B | Estimated at $65-70B including massive rounds for OpenAI, Anthropic, xAI. Way above my range. | Wrong | | 8 | The EU AI Act will cause at least one major AI product to restrict European access | No major product restricted European access specifically because of the AI Act in 2024. The Act passed but enforcement hasn't started. | Wrong | | 9 | Mistral AI will become a top-5 AI company by market presence | Mistral is arguably #5-6, behind OpenAI, Anthropic, Google, Meta, and arguably tied with Alibaba/Qwen. I'll be generous and call this half. | Half | | 10 | I will need to expand my tracking spreadsheet to over 500 models | My spreadsheet has 487 models as of December. Close but not 500. | Wrong |
Final score: 4 right, 3 half-right, 3 wrong.
Year-over-year accuracy
| Year | Right | Half | Wrong | Hit rate (right + half*0.5) | |------|-------|------|-------|---------------------------| | 2020 | 5 | 2 | 3 | 60% | | 2021 | 4 | 4 | 2 | 60% | | 2022 | 4 | 3 | 3 | 55% | | 2023 | 5 | 2 | 3 | 60% | | 2024 | 4 | 3 | 3 | 55% |
Back to 55%. My accuracy range across five years: 55-60%. Consistently mediocre. At least I'm consistent.
My biggest miss: not predicting reasoning models
I predicted GPT-5 would be a bigger, better version of GPT-4. More parameters, more training data, higher benchmarks across the board. The scaling approach continuing upward.
Instead, OpenAI shipped o1, which isn't bigger at all. It's the same size (roughly GPT-4 class) but thinks longer before answering. Inference-time compute, not training-time scale.
This was a fundamental conceptual miss, not a detail miss. I had the wrong mental model of how AI progress would work in 2024.
What o1 showed: you don't need a bigger model to get dramatically better answers. You need a model that spends more time reasoning. MATH went from 60% (GPT-4o) to 83% (o1-preview) without any increase in model size.
I should have seen this coming. The "let's think step by step" prompting trick has been around since 2022. Chain-of-thought reasoning was well-established. The logical next step was training a model to do it automatically. But I was so fixated on "bigger models" that I missed "smarter inference."
My biggest hit: open source reaching GPT-4
Prediction 2 was "an open source model will match GPT-4 on MMLU." This was my highest-conviction prediction, and it landed exactly right.
Llama 3.1 405B scored 87.3% on MMLU, beating GPT-4 Turbo's 86.4%. On 9 of 10 standard benchmarks, the open model matched or exceeded the closed one.
I predicted this because the Chinchilla scaling laws plus Meta's resources made it inevitable. Meta had the compute, the data pipeline, and the strategic motivation (they want AI to be a commodity, not a moat for competitors).
My other big miss: AI funding
I predicted $20-35B in AI VC funding. The actual number was roughly $65-70B. I underestimated because I thought the 2023 funding pace ($28.9B) would moderate.
Instead, the largest AI funding rounds in history happened in 2024:
| Company | Round | Amount | Date | |---------|-------|--------|------| | OpenAI | Series ? | ~$6.6B | Oct 2024 | | Anthropic | Various | ~$7.3B total 2024 | Throughout | | xAI | Series B | $6B | Dec 2024 | | Multiple others | Various | ~$45B total | Throughout |
Sources: Press reporting, Crunchbase, PitchBook estimates.
Three companies alone raised ~$20B. The long tail of smaller AI companies raised another ~$45B. My range was off by nearly 2x.
2025 predictions
Ten predictions for 2025, to be scored next December:
- At least one AI agent will be used in production by a Fortune 500 company for customer-facing tasks (not just internal copilots)
- The cost of frontier-quality inference will drop below $1/M output tokens
- An open source reasoning model will match o1-preview on MATH (above 80%)
- AI coding assistants will be used by over 60% of professional developers
- At least one major AI company will be acquired for $5B+
- Context windows exceeding 1M tokens will become standard (at least 3 providers)
- A model with under 10B parameters will score 85%+ on MMLU
- Total AI VC funding will be $40-80B (wider range this time, learned my lesson)
- DeepSeek or another Chinese lab will release a model that tops the LMSYS Chatbot Arena
- I will finally build a proper dashboard for my tracking data instead of using spreadsheets (most ambitious prediction)
Prediction 3 is my highest conviction. DeepSeek R1 or a similar open reasoning model is coming, and the bar (83% on MATH) is reachable.
Prediction 10 is my lowest conviction. I've been saying "I should build a dashboard" since 2022.
See you next December. The spreadsheet is ready. The hubris is calibrated.
If you found this interesting, you might also like:
- My 2022 prediction scorecard: how wrong was I?
- My 2023 prediction scorecard
- 5 charts that explain why GPU prices went insane in 2021
- AI research papers published in 2021: a mid-year count
- My 2021 AI data roundup: the 10 numbers that mattered most
-- dataku