Model ComparisonsJuly 26, 20217 min read

GPT-3 vs GPT-J: the first real open source challenger, in data

EleutherAI released GPT-J-6B and I benchmarked it against GPT-3's comparable size. For a free model, the numbers are surprisingly close on some tasks.

Something happened in June that I think will matter a lot more than people realize right now.

EleutherAI released GPT-J-6B. A 6-billion parameter language model, trained on The Pile, with weights you can download and run yourself. Free. Open source. No waitlist, no API key, no terms of service telling you what you can't generate.

I've been benchmarking it against GPT-3 for the past month. Not against GPT-3 175B (that would be unfair, it's 29x bigger), but against GPT-3's 6.7B parameter engine, Curie. Same ballpark in size. Fair fight.

The results are closer than I expected.

The test setup

I ran both models on the same set of tasks. For GPT-J, I used the Hugging Face hosted model and a local instance on a rented A100. For GPT-3 Curie, I used OpenAI's API.

All tests used the same prompts, same temperature (0.7 for creative tasks, 0 for factual), same max token limits. I ran each test 3 times and averaged the scores.

Test categories:

  • Factual Q&A (50 questions from TriviaQA)
  • Common sense reasoning (30 HellaSwag-style completions)
  • Code generation (40 Python function prompts)
  • Creative writing (20 story continuation prompts)
  • Summarization (20 article summaries)
  • Classification (50 sentiment analysis examples)
  • Translation (20 English-to-French, 20 English-to-German)

Head-to-head results

| Task | GPT-J-6B | GPT-3 Curie (6.7B) | Winner | Margin | |------|----------|-------------------|--------|--------| | Factual Q&A (accuracy %) | 38.2% | 41.6% | Curie | +3.4 | | Common sense (accuracy %) | 62.1% | 66.8% | Curie | +4.7 | | Code generation (quality 1-5) | 3.1 | 3.3 | Curie | +0.2 | | Creative writing (quality 1-5) | 3.6 | 3.4 | GPT-J | +0.2 | | Summarization (quality 1-5) | 3.2 | 3.5 | Curie | +0.3 | | Classification (accuracy %) | 84.6% | 86.2% | Curie | +1.6 | | Translation EN-FR (BLEU) | 24.1 | 28.7 | Curie | +4.6 | | Translation EN-DE (BLEU) | 19.8 | 23.4 | Curie | +3.6 |

Curie wins 7 out of 8 categories. But look at the margins. On code generation, it's basically a tie (0.2 difference on a 5-point scale is noise). On classification, GPT-J hits 84.6% accuracy, which is close enough to be useful for most applications.

And GPT-J actually wins on creative writing. This surprised me. I had three people blind-rate the story continuations, and they consistently preferred GPT-J's outputs as "more interesting" and "less formulaic." Small sample, subjective metric, but it was consistent enough to note.

The cost comparison (this is the real story)

Performance matters, but the cost gap is where this gets truly interesting.

| Metric | GPT-J-6B (self-hosted) | GPT-J-6B (Hugging Face) | GPT-3 Curie (API) | |--------|----------------------|------------------------|-------------------| | Price model | GPU rental | Free tier / pay per compute | Per token | | Cost per 1K tokens | ~$0.0004* | Free (limited) | $0.006 | | Monthly cost (1M tokens/day) | ~$360 | N/A | $180 | | Monthly cost (10M tokens/day) | ~$360 | N/A | $1,800 |

*Self-hosted cost assumes an A100 rental at ~$2.50/hour, generating approximately 6,000 tokens per second.

At low volume (under 1M tokens/day), GPT-3 Curie is actually cheaper because you don't have a fixed GPU cost. But cross 2M tokens/day and GPT-J self-hosted starts winning. At 10M tokens/day, the difference is massive: $360/month vs $1,800/month.

And this ignores the other advantages of self-hosting: no rate limits, no content policy filtering, no dependency on an external API that might change pricing or terms.

Training data differences

GPT-3 was trained on a mix of Common Crawl, WebText2, Books1, Books2, and Wikipedia. The exact dataset composition isn't fully public.

GPT-J was trained on The Pile, EleutherAI's 800GB open-source dataset. The Pile includes:

| Dataset component | Size (GB) | Share | |------------------|-----------|-------| | Pile-CC (Common Crawl) | 227 | 28.4% | | PubMed Central | 90 | 11.3% | | Books3 | 101 | 12.6% | | OpenWebText2 | 63 | 7.9% | | ArXiv | 56 | 7.0% | | GitHub | 95 | 11.9% | | Wikipedia | 17 | 2.1% | | Other (StackExchange, FreeLaw, etc.) | 151 | 18.8% |

That GitHub component (11.9% of training data) is probably why GPT-J performs surprisingly well on code generation. It's seen more code during training than you'd expect from a general-purpose language model.

The PubMed and ArXiv components also mean GPT-J has more scientific text exposure. I noticed this in my testing: GPT-J handles scientific terminology noticeably better than Curie on questions about biology and physics.

What this means for the field

Let me be clear about what I'm NOT saying. GPT-J is not a GPT-3 killer. GPT-3 175B is a completely different beast. I ran a few of my tests against Davinci (175B), and it crushed GPT-J across the board. The gap between 6B and 175B parameters is real and large.

What I AM saying is this: the gap between the best open source model and the best closed model at the same parameter count has collapsed to single-digit percentage points in most tasks.

Let me put that in historical context:

| Date | Best open model | Best closed model (similar size) | Performance gap | |------|----------------|--------------------------------|----------------| | 2019 | GPT-2 (1.5B) | No comparable closed model | N/A | | Early 2020 | GPT-2 (1.5B) | GPT-3 Babbage (1.3B) | ~15-20% | | Late 2020 | GPT-Neo 2.7B | GPT-3 Curie (6.7B) | ~20-25% (size disadvantage) | | Mid 2021 | GPT-J-6B | GPT-3 Curie (6.7B) | ~3-8% |

That trajectory is steep. EleutherAI and the open source community are closing the gap at a pace that should make OpenAI pay attention. Not because open source will replicate GPT-3 175B tomorrow, but because the community of people who can train, modify, and deploy large language models without depending on a single company's API is growing fast.

The latency question

One thing I didn't fully capture in the benchmark scores: latency. GPT-3 Curie via API returns results in about 200-400ms for short completions. GPT-J on a single A100 takes about 300-500ms for the same length. On a lesser GPU (like a V100), it balloons to 800-1200ms.

For interactive applications, this matters. For batch processing, it doesn't. Know your use case.

My honest assessment

GPT-J-6B is the first open source language model I'd consider using in a real product. Not for everything. Not as a Davinci replacement. But for classification, simple Q&A, content generation, and code assistance at the 6B-parameter scale, it's genuinely competitive with the commercial option.

That's a sentence I couldn't have written a year ago. The open source AI world is getting serious. And with EleutherAI already working on larger models (GPT-NeoX-20B is coming), this is just the beginning.

I'll keep benchmarking as new models drop. The spreadsheet grows.


If you found this interesting, you might also like:

-- dataku

More from dataku