Data StoriesOctober 16, 20236 min read

Every major LLM's context window, charted over time

In January 2023, 4K tokens was standard. By October, we've got 100K (Claude), 32K (GPT-4), and 128K (Anthropic internal). I charted the context window growth curve. It's exponential.

I track a lot of numbers about LLMs. Parameter counts, benchmark scores, pricing, training costs. But the number that's been moving fastest in 2023 is one most people don't think about: context window size.

Let me show you the chart.

The context window timeline

| Date | Model | Context window (tokens) | Approx. words | |------|-------|------------------------|---------------| | Jun 2020 | GPT-3 | 2,048 | 1,500 | | Nov 2022 | ChatGPT (GPT-3.5) | 4,096 | 3,000 | | Jan 2023 | Claude v1 | 9,000 | 6,750 | | Mar 2023 | GPT-4 (standard) | 8,192 | 6,100 | | Mar 2023 | GPT-4 (32K) | 32,768 | 24,500 | | Mar 2023 | Claude v1.3 (100K) | 100,000 | 75,000 | | Jul 2023 | Claude 2 | 100,000 | 75,000 | | Jul 2023 | Llama 2 | 4,096 | 3,000 | | Sep 2023 | Mistral 7B | 8,192 (sliding window) | 6,100 | | Oct 2023 | GPT-4 Turbo (rumored) | 128,000 | 96,000 |

Sources: OpenAI, Anthropic, Meta AI, Mistral AI, API documentation pages.

From 2,048 tokens in June 2020 to 100,000 in March 2023. That's a 49x increase in under 3 years. And if the GPT-4 Turbo rumors are accurate (128K context), the frontier will have grown 62x by end of 2023.

The growth rate is exponential

Let me plot the maximum context window available at any given time:

| Date | Max available context | Months since GPT-3 | Growth factor (vs GPT-3) | |------|----------------------|--------------------|-----------------------| | Jun 2020 | 2,048 | 0 | 1x | | Dec 2021 | 2,048 | 18 | 1x | | Nov 2022 | 4,096 | 29 | 2x | | Jan 2023 | 9,000 | 31 | 4.4x | | Mar 2023 | 100,000 | 33 | 48.8x | | Oct 2023 | 100,000 | 40 | 48.8x |

Context windows were flat for 18 months (GPT-3's 2K limit was the ceiling), then exploded in early 2023. The jump from 4K to 100K happened in just 4 months (November 2022 to March 2023).

This looks like a step function more than smooth exponential growth. Each jump corresponds to a specific technical achievement:

  • 4K to 9K: Anthropic's initial Claude release
  • 9K to 32K: OpenAI's GPT-4 32K variant
  • 32K to 100K: Anthropic's context extension for Claude

Why context windows matter (with numbers)

Here's what you can fit in different context sizes:

| Context size | What fits | Practical use case | |-------------|----------|-------------------| | 2K tokens | ~1.5 pages | A single prompt with instructions | | 4K tokens | ~3 pages | A short email chain | | 8K tokens | ~6 pages | A long blog post | | 32K tokens | ~24 pages | A research paper | | 100K tokens | ~75 pages | A short book, a full codebase | | 128K tokens | ~96 pages | A long novel |

The jump from 4K to 100K isn't a quantitative improvement. It's a qualitative one. At 4K tokens, you can't even fit a single research paper in the context. At 100K, you can feed the model an entire book and ask questions about it. Different category of use case.

The infrastructure implications

Larger context windows cost more to serve. The computational cost of attention scales quadratically with context length:

| Context length | Relative attention cost | Relative memory usage | |---------------|------------------------|---------------------| | 4K | 1x | 1x | | 8K | 4x | 2x | | 32K | 64x | 8x | | 100K | 625x | 25x | | 128K | 1,024x | 32x |

These are theoretical worst-case numbers for standard attention. In practice, techniques like Flash Attention, sliding window attention (Mistral), and ALiBi position encoding reduce the actual cost. But the quadratic relationship is still a fundamental constraint.

This is why long-context API calls are expensive. When Anthropic charges $0.008/1K input tokens, sending a 100K prompt costs $0.80 just for the input. The computational cost justifies the price.

Who's winning and losing

| Provider | Current max context | Open source? | Status | |----------|-------------------|-------------|--------| | Anthropic Claude 2 | 100K | No | Leading | | OpenAI GPT-4 | 32K (128K rumored) | No | Catching up | | Google PaLM 2 | 32K | No | Competitive | | Meta Llama 2 | 4K | Yes | Behind | | Mistral 7B | 8K (sliding window) | Yes | Clever approach |

Anthropic is the clear leader in context length. But if OpenAI ships 128K with GPT-4 Turbo (which I expect at their upcoming DevDay on November 6), they'll take the crown.

The open source gap is notable. Llama 2's 4K context is the same as ChatGPT at launch. The community has been extending it (with RoPE scaling and other techniques), but native long context in open source models lags by roughly 6-12 months.

Mistral's sliding window attention is an elegant middle ground: 8K effective context with much lower memory cost. It doesn't compete with 100K raw context, but it's the right trade-off for a 7B model.

What I think happens next

My predictions for context windows by mid-2024:

| Prediction | Confidence | |-----------|------------| | At least one model will have 1M+ token context | 70% | | Standard context for new models will be 32K minimum | 85% | | Open source models will reach 100K context | 60% | | Cost per 100K input tokens will drop below $0.50 | 75% |

The context window race is a real differentiator right now, especially for enterprise use cases (legal document review, codebase analysis, meeting transcripts). Whoever has the longest reliable context window gets those contracts.

But I think context length will commoditize within a year. When everyone has 100K+, the differentiator shifts back to quality and pricing. Until then, Anthropic has a genuine competitive advantage, and they're using it well.


If you found this interesting, you might also like:

-- dataku

More from dataku