InstructGPT and RLHF: what the training data tells us

I've read the InstructGPT paper four times now. Not because it's hard to understand, but because the details about the human data pipeline are the most interesting thing I've read in an AI paper all year.

Everyone talks about RLHF (Reinforcement Learning from Human Feedback) as the magic ingredient. But nobody digs into what the "Human Feedback" part actually looked like. The paper has surprisingly specific numbers about the labeling workforce, the process, and the data quality. Let me walk through all of it.

The workforce

OpenAI hired approximately 40 contractors through Upwork and ScaleAI to provide human feedback. That number is so small it made me recount. Forty people. A model that would become the foundation for what is arguably the most important consumer AI product in history was shaped by the preferences of forty humans.

The paper discloses some details about selection:

| Screening metric | Requirement | |-----------------|------------| | Agreement rate with researchers | >75% on a test set | | Sensitive content handling | Passed screening task | | Time per task (comparison) | 3-8 minutes | | Labeler retention (stayed through project) | ~30 of 40 (est.) |

The labelers were filtered by how well their preferences aligned with the OpenAI research team's preferences. This is a critical detail. RLHF doesn't train the model on "what humans want." It trains the model on "what these specific 40 humans, selected to agree with OpenAI researchers, want."

That's not a criticism. It's just precise language about what the data represents.

The three datasets

The InstructGPT training process used three distinct datasets, each serving a different purpose:

| Dataset | Size | Purpose | Collection method | |---------|------|---------|------------------| | SFT (Supervised Fine-Tuning) | ~13,000 prompts | Initial behavior shaping | Labelers write ideal responses | | RM (Reward Model) | ~33,000 prompts with rankings | Train the reward function | Labelers rank 4-9 outputs per prompt | | PPO (Policy) | ~31,000 prompts | Reinforcement learning training | No new human data, uses RM as reward |

Here's what caught my eye: 13,000 prompts for the SFT dataset. That's the dataset where human labelers wrote the "ideal" response to each prompt. Thirteen thousand examples shaped the initial behavior of a 175-billion parameter model.

The RM dataset is larger at 33,000 prompts, but each prompt comes with 4 to 9 model-generated outputs ranked by labelers. So the actual number of comparison pairs is much larger, roughly 120,000-180,000 pairwise comparisons (my estimate based on the combinatorics).

The 5-step process

The paper describes a clear pipeline:

Step 1. Collect demonstration data. Labelers are given prompts and write the ideal response. This creates the SFT dataset.

Step 2. Train a supervised fine-tuning model on the demonstration data. This gives you a model that's trying to mimic human-written responses.

Step 3. Collect comparison data. The SFT model generates multiple outputs for each prompt. Labelers rank the outputs from best to worst.

Step 4. Train a reward model on the comparison data. This model learns to predict which outputs humans prefer.

Step 5. Use PPO (Proximal Policy Optimization) to fine-tune the language model using the reward model as a signal. The language model learns to generate outputs that score highly according to the reward model.

Steps 3-5 can be iterated. The paper shows results after one iteration, but mentions that additional iterations are possible.

Data quality metrics

The paper includes inter-annotator agreement rates, which is rare and valuable:

| Task type | Agreement rate | |-----------|---------------| | Ranking (which output is best) | 73% | | Rating (absolute quality score) | 68% | | Safety (is output harmful) | 79% |

A 73% agreement rate on rankings means that roughly 1 in 4 comparison pairs has disagreement among labelers. That's actually reasonable for subjective tasks. Anthropic published similar figures for their own RLHF work, showing inter-annotator agreement in the 70-80% range.

What does 73% agreement mean in practice? It means the reward model is learning from noisy data. Some of the "this output is better than that output" labels are wrong (or at least, not universally agreed upon). The model has to learn the signal through the noise.

Why this matters more than parameter counts

Here's my opinion, and it's a strong one: the InstructGPT paper is more important than the GPT-3 paper.

GPT-3 was about scale. Make the model bigger, give it more data, and it gets better. That's interesting but not surprising.

InstructGPT is about alignment. Take an existing model and make it do what you actually want, using a tiny amount of human feedback. The SFT dataset is 13,000 examples. The model has 175 billion parameters. That's a ratio of roughly one example per 13.5 million parameters. And it works.

The OpenAI alignment blog post accompanying the paper shows that InstructGPT with 1.3B parameters is preferred by humans over the base GPT-3 with 175B parameters. A model that's 135x smaller, fine-tuned with RLHF, beats the raw giant.

| Model | Parameters | Human preference rate (vs GPT-3 175B) | |-------|-----------|--------------------------------------| | GPT-3 175B (base) | 175B | 50% (baseline) | | InstructGPT 1.3B | 1.3B | 62% | | InstructGPT 6B | 6B | 68% | | InstructGPT 175B | 175B | 78% |

That bottom row. InstructGPT 175B (same parameter count as GPT-3) is preferred 78% of the time. RLHF turned a 50/50 coin flip into a 78/22 blowout, using 13,000 demonstrations and 33,000 comparison prompts.

The labor economics

I keep thinking about those 40 labelers. The paper doesn't disclose pay rates, but Upwork AI training tasks in 2022 typically pay $15-25/hour. If we assume 20,000 total labor hours (a rough estimate for producing and quality-checking all three datasets), the human feedback cost is somewhere in the range of $300,000-$500,000.

That's less than 10% of GPT-3's training compute cost. The RLHF fine-tuning step itself uses far less compute than pre-training. So for under a million dollars of human labor and compute combined, OpenAI turned GPT-3 from a text completion engine into something that follows instructions.

The cost-effectiveness is staggering. And it suggests that the bottleneck for making better AI isn't compute or model size. It's high-quality human feedback data.

What I expected vs. what the paper shows

I expected RLHF to require hundreds of thousands of labeled examples. It uses tens of thousands. I expected a large labeling workforce. It used 40 people. I expected the human data to be the expensive part. It's the cheap part.

The whole approach of "train a giant model and then steer it with a small amount of human data" is, I think, the most important idea in AI right now. Not bigger models. Better feedback.

I'll be watching Anthropic's approach closely. Their Constitutional AI work takes RLHF in a different direction, using AI-generated feedback to supplement human feedback. If that works at scale, the 40-labeler bottleneck disappears entirely.

The data on this one is still unfolding. But the InstructGPT paper gave us more concrete numbers about RLHF than anything else published so far, and that makes it invaluable.

If you found this interesting, you might also like:

-- dataku

InstructGPT and RLHF: what the training data tells us

The workforce

The three datasets

The 5-step process

Data quality metrics

Why this matters more than parameter counts

The labor economics

What I expected vs. what the paper shows

More from dataku

My monthly benchmark dashboard: March 2026 update

Claude Opus 4.5: Anthropic's latest flagship, benchmarked

The state of AI benchmarks in early 2026: what still works?