ChatGPT vs GPT-3: same model family, wildly different results. The data.
ChatGPT is based on GPT-3.5, but it behaves nothing like the raw API. I ran 200 identical prompts on both. ChatGPT refuses 23% of prompts that GPT-3 answers happily. RLHF changed more than people think.
Everyone keeps saying ChatGPT is "just GPT-3 with a chat interface." I've been testing both side by side for two weeks and that description is wrong in ways that matter.
ChatGPT uses GPT-3.5 (a fine-tuned version of GPT-3 with RLHF training). The base GPT-3 is available through OpenAI's API as text-davinci-003. Same model family. Same company. Very different behavior.
I ran 200 identical prompts through both and categorized every difference. Here's what I found.
The test
200 prompts across 8 categories:
- Factual Q&A (25)
- Creative writing (25)
- Code generation (25)
- Opinion/analysis (25)
- Sensitive topics (25)
- Instruction following (25)
- Roleplay/persona (25)
- Math/reasoning (25)
Each prompt was sent to both ChatGPT (web interface) and GPT-3 text-davinci-003 (API, temperature 0.7). I recorded whether each model answered, refused, or partially answered, then rated response quality on a 1-5 scale.
The refusal gap
This is the most striking finding.
| Category | GPT-3 refusal rate | ChatGPT refusal rate | Difference | |----------|-------------------|---------------------|------------| | Factual Q&A | 0% | 0% | 0 | | Creative writing | 0% | 4% | +4% | | Code generation | 0% | 8% | +8% | | Opinion/analysis | 0% | 12% | +12% | | Sensitive topics | 4% | 64% | +60% | | Instruction following | 0% | 4% | +4% | | Roleplay/persona | 0% | 28% | +28% | | Math/reasoning | 0% | 0% | 0 | | Overall | 0.5% | 15% | +14.5% |
GPT-3 almost never refuses. 0.5% refusal rate (one prompt out of 200 triggered a content filter). ChatGPT refuses 15% of the time.
The sensitive topics category is the most dramatic: 64% refusal rate for ChatGPT vs. 4% for GPT-3. These weren't harmful prompts. They included questions about controversial historical events, political opinions, and hypothetical ethical dilemmas. ChatGPT is trained to be cautious, and it's cautious to a degree that sometimes prevents useful answers.
The roleplay refusal rate (28%) also caught my eye. ChatGPT often refuses to adopt personas that GPT-3 handles without complaint. "Pretend you're a CEO writing a memo about layoffs" gets a response from GPT-3 and a refusal from ChatGPT.
Quality comparison (on prompts both answered)
When both models actually produce an answer, which one is better?
| Category | GPT-3 avg quality | ChatGPT avg quality | Winner | |----------|-------------------|--------------------| -------| | Factual Q&A | 3.8 | 4.3 | ChatGPT | | Creative writing | 3.6 | 4.1 | ChatGPT | | Code generation | 3.4 | 4.2 | ChatGPT | | Opinion/analysis | 3.2 | 4.0 | ChatGPT | | Sensitive topics | 3.0 | 3.5* | ChatGPT | | Instruction following | 3.3 | 4.4 | ChatGPT | | Roleplay/persona | 3.5 | 3.8* | ChatGPT | | Math/reasoning | 3.1 | 3.9 | ChatGPT |
*Only scored on prompts where ChatGPT didn't refuse.
ChatGPT wins every category on quality. The gap ranges from 0.3 (roleplay) to 1.1 (instruction following). The instruction following gap is the most meaningful. When you tell ChatGPT "write a 3-paragraph essay with a counterargument in paragraph 2," it actually follows the structure. GPT-3 is much more likely to ignore structural instructions.
This is the RLHF effect. The InstructGPT paper showed that human feedback training dramatically improves instruction adherence, and ChatGPT is the consumer-facing implementation of that research.
Response style differences
Beyond quality scores, the two models write differently. I categorized the stylistic differences:
| Characteristic | GPT-3 | ChatGPT | |---------------|-------|---------| | Response length (avg tokens) | 142 | 287 | | Uses bullet points | 18% of responses | 62% of responses | | Includes caveats/disclaimers | 8% of responses | 71% of responses | | Hedging language ("it depends") | 12% | 44% | | Asks for clarification | 2% | 18% | | Structured (headers, lists) | 15% | 58% |
ChatGPT's responses are twice as long on average. It uses bullet points three times more often. It includes caveats 71% of the time (compared to GPT-3's 8%). And it hedges four times more frequently.
RLHF trained ChatGPT to be thorough, structured, and cautious. GPT-3 is more direct, shorter, and gives you exactly what you asked for without the "however, there are several perspectives to consider..." wrapper.
Whether ChatGPT's style is better depends on what you want. For casual users who need help with homework or writing, the longer, structured, caveated style is probably better. For developers who need a quick answer, GPT-3's directness is more efficient.
The refusal patterns
I looked at exactly what triggers ChatGPT's refusal. The patterns are clear:
| Trigger type | Refusal rate (ChatGPT) | |-------------|----------------------| | Mentions of violence (even historical/academic) | 52% | | Requests for opinions on public figures | 38% | | Roleplay as specific real people | 72% | | Hypothetical unethical scenarios | 48% | | Medical/legal advice requests | 44% | | Explicit content requests | 88% | | Hacking/security discussion | 36% |
Some of these refusals are reasonable (explicit content, actual hacking instructions). Some are overly cautious (refusing to discuss historical violence in an academic context, declining to have an opinion on public policy).
The roleplay-as-real-people refusal rate of 72% is noteworthy. "Write a speech as if you were Abraham Lincoln about modern politics" gets refused more often than not. GPT-3 does it happily and the results are often quite good.
What RLHF actually changed
Looking at all 200 prompts in aggregate, the picture is clear. RLHF did three things to the model:
-
Made it better at following instructions. Quality scores are higher across the board. ChatGPT understands what you want and structures its response accordingly.
-
Made it much more cautious. The refusal rate went from 0.5% to 15%. That's a 30x increase in saying "I can't do that."
-
Changed the default communication style. Longer, more structured, more hedged, more disclaimers. ChatGPT sounds like a helpful customer service agent. GPT-3 sounds like a text completion engine.
These aren't small changes. This is a fundamentally different user experience built on the same underlying model family. The people saying "ChatGPT is just GPT-3 with a wrapper" are missing the point. The wrapper IS the product.
The data shows that RLHF didn't just tweak the model. It reshaped how the model interacts with humans. Whether you think that reshaping is good depends on whether you value safety-first caution or directness-first utility.
I value both, depending on the task. Which is why I'll keep using both.
If you found this interesting, you might also like:
- GPT-3 vs GPT-J: the first real open source challenger, in data
- DALL-E's first images vs what people expected: a data comparison
- Google's PaLM has 540 billion parameters. Let me put that number in context.
- Midjourney v3 vs DALL-E 2: 100 prompts, head to head
- The GPT-3 API waitlist is 6 months long. Here's what the early data looks like.
-- dataku