Benchmark AnalysisNovember 14, 20226 min read

I ran GPT-3 on the same 50 questions every month for a year. Here's the drift.

Model outputs aren't static. I asked GPT-3 the same 50 factual questions monthly for 12 months. 17 answers changed. Some got better. Some got worse. 'Model drift' is real and measurable.

In November 2021, I set up a cron job.

Every month on the 1st, a Python script sends the same 50 factual questions to OpenAI's GPT-3 API (text-davinci-002), with temperature set to 0 (deterministic output), and logs the responses. Same prompts, same settings, every month.

I just pulled the 12-month data. The results are... not what people assume about a "frozen" model.

The setup

50 questions across 5 categories:

  • History (10): Dates, events, people
  • Science (10): Facts, constants, processes
  • Geography (10): Capitals, populations, features
  • Math (10): Calculations, concepts
  • Current events (10): Things that were true in late 2021

Temperature: 0 (should produce identical outputs for identical inputs) Model: text-davinci-002 via OpenAI API Max tokens: 200 per response

The headline finding

Of 50 questions asked monthly for 12 months (600 total question-answer pairs), 17 questions produced at least one different answer at some point during the year.

| Category | Questions | Answers that changed | Change rate | |----------|-----------|---------------------|-------------| | History | 10 | 2 | 20% | | Science | 10 | 1 | 10% | | Geography | 10 | 4 | 40% | | Math | 10 | 3 | 30% | | Current events | 10 | 7 | 70% |

34% of questions produced different answers at some point. But the distribution is uneven. Science questions were the most stable (only 1 change). Current events questions were the most volatile (7 changes).

What "changed" means

Let me show some actual examples. I won't list all 17, but here are representative ones from each category.

History (changed answer):

  • Q: "When did the Berlin Wall fall?"
  • Nov 2021 - Jun 2022: "The Berlin Wall fell on November 9, 1989."
  • Jul 2022 - Oct 2022: "The Berlin Wall fell on November 9, 1989, marking the end of the Cold War division of Germany."
  • The core fact didn't change, but the response added context. I count this as a change because the output bytes are different.

Geography (changed answer):

  • Q: "What is the population of Tokyo?"
  • Nov 2021 - Mar 2022: "The population of Tokyo is approximately 13.96 million as of 2021."
  • Apr 2022 - Oct 2022: "The population of Tokyo is approximately 13.96 million people."
  • The "as of 2021" qualifier disappeared. Small change, but detectable.

Math (changed answer):

  • Q: "What is 17 x 23?"
  • Nov 2021 - Feb 2022: "17 x 23 = 391"
  • Mar 2022: "17 x 23 = 391. To calculate this, I multiplied 17 by 20 to get 340, then 17 by 3 to get 51, and added them together."
  • Apr 2022 - Oct 2022: "17 x 23 = 391"
  • The answer was correct the entire time, but in March it added a work-showing explanation, then reverted. This strongly suggests OpenAI updated the model or its system prompt between February and March, and again between March and April.

Current events (changed answer):

  • Q: "Who is the CEO of Twitter?"
  • Nov 2021 - Mar 2022: "The CEO of Twitter is Jack Dorsey." (Incorrect after Nov 29, 2021)
  • Apr 2022 - Oct 2022: "The CEO of Twitter is Parag Agrawal."
  • The model eventually updated, but it took about 4 months to reflect the leadership change. And the October answer doesn't reflect the pending Musk acquisition.

The drift timeline

I plotted the number of answers that were different from the previous month's answer:

| Month transition | Answers that changed | |-----------------|---------------------| | Nov -> Dec 2021 | 0 | | Dec -> Jan 2022 | 2 | | Jan -> Feb 2022 | 1 | | Feb -> Mar 2022 | 5 | | Mar -> Apr 2022 | 4 | | Apr -> May 2022 | 0 | | May -> Jun 2022 | 1 | | Jun -> Jul 2022 | 3 | | Jul -> Aug 2022 | 0 | | Aug -> Sep 2022 | 1 | | Sep -> Oct 2022 | 0 |

The February-to-March and March-to-April transitions had the most changes (5 and 4 respectively). This aligns with OpenAI releasing InstructGPT and updating the text-davinci models during that period. The changes aren't random drift. They cluster around model update events.

Did the answers get better or worse?

Of the 17 questions that changed:

| Direction | Count | Example | |-----------|-------|---------| | Improved (more accurate) | 8 | Twitter CEO corrected | | Neutral (same accuracy, different phrasing) | 5 | Added/removed context | | Degraded (less accurate) | 3 | Population number slightly off | | Ambiguous | 1 | Math work shown then removed |

Most changes were improvements or neutral rephrasing. But 3 answers got worse. One geography question about a country's GDP shifted from a correct figure to a slightly outdated one after a model update. Another changed from a precise answer to a vaguer one.

Why this matters

If you're building a product on GPT-3 and you have tests that verify specific outputs, your tests will fail periodically even though your code didn't change. The model underneath is shifting.

Weights & Biases has published similar observations about model drift in their monitoring documentation. It's a known phenomenon in ML operations, but I haven't seen systematic measurements of it for GPT-3 over this time scale.

For research, the implications are serious. If you benchmark GPT-3 in January and someone else benchmarks it in July, you might get different numbers on the same test, using the same API, with the same settings. That makes cross-study comparisons unreliable unless you control for the model version (which OpenAI now provides via model snapshots like text-davinci-002).

What I learned from the exercise

Three things.

First, temperature=0 does not mean "always the same output." It means "most likely token at each step." If the model weights change (even slightly), the most likely tokens can change, and the output shifts.

Second, OpenAI updates their models without announcement. The February-March cluster of changes happened without any public changelog that I could find. If you're running GPT-3 in production, you're running a moving target.

Third, factual accuracy is not a static property. The model "knew" the wrong Twitter CEO for four months. It might "know" the wrong answer to other questions right now, and we won't find out until someone checks.

I'm continuing the experiment into 2023. The 24-month dataset will be twice as interesting.


If you found this interesting, you might also like:

-- dataku

More from dataku