AI Training Data Tracker

What data went into each model? Training sources, cutoff dates, and known issues. This is my attempt to document what the labs won't tell you clearly.

GPT-4 / GPT-4 Turbo

OpenAICutoff: April 2024

Training Data Sources

Web crawl data, books, code repositories, Wikipedia, licensed datasets. Specific composition undisclosed.

Known Issues

Training data copyright lawsuits from NYT, Authors Guild. Potential memorization of copyrighted text.

Source

OpenAI technical report (March 2023), system card updates

GPT-4o

OpenAICutoff: October 2023

Training Data Sources

Same base as GPT-4 plus additional multimodal data (images, audio). Web crawl through late 2023.

Known Issues

Inherits GPT-4 copyright concerns. Image training data sources undisclosed.

Source

OpenAI blog (May 2024), API documentation

GPT-4o Mini

OpenAICutoff: October 2023

Training Data Sources

Distilled from GPT-4o. Training mix not separately disclosed.

Known Issues

Distilled model, so inherits base model biases at lower fidelity.

Source

OpenAI blog (July 2024)

o1 / o1-mini

OpenAICutoff: October 2023

Training Data Sources

GPT-4o base plus reinforcement learning on chain-of-thought reasoning. Additional math and science training data.

Known Issues

Reasoning traces can include fabricated intermediate steps. Higher hallucination rate on factual recall.

Source

OpenAI blog (September 2024), technical report

Claude 3 Opus / Sonnet / Haiku

AnthropicCutoff: August 2023

Training Data Sources

Web data, public code, books, academic papers. Constitutional AI (RLHF with AI feedback). Specific mix undisclosed.

Known Issues

Training data composition fully undisclosed. Anthropic publishes usage policies but not data sources.

Source

Anthropic model card (March 2024)

Claude 3.5 Sonnet

AnthropicCutoff: April 2024

Training Data Sources

Updated training data beyond Claude 3 cutoff. Specific sources not disclosed. Extended code training.

Known Issues

Same opacity as Claude 3. No public data audit.

Source

Anthropic blog (June 2024)

Claude Opus 4 / Sonnet 4

AnthropicCutoff: March 2025

Training Data Sources

Expanded training corpus. Likely includes web data through early 2025. Details not disclosed.

Known Issues

Anthropic remains the least transparent major lab about training data composition.

Source

Anthropic blog (June 2025)

Gemini 1.5 Pro

GoogleCutoff: November 2023

Training Data Sources

Web documents, books, code, math data, multimodal data (images, video, audio). Trained on Google's proprietary data pipeline.

Known Issues

Uses Google Search index data, raising questions about data consent. YouTube transcripts reportedly included.

Source

Gemini technical report (Feb 2024), Google blog

Gemini 2.0 Flash / Pro

GoogleCutoff: August 2024

Training Data Sources

Updated Gemini pipeline. Extended web, code, and multimodal data. Specific composition not disclosed.

Known Issues

YouTube and Google Books content inclusion contested. Reddit licensing deal covers only post-2024 data.

Source

Google blog (February 2025)

Llama 2 (7B/13B/70B)

MetaCutoff: July 2023

Training Data Sources

2T tokens from publicly available sources. No Meta user data. Includes CommonCrawl, C4, Wikipedia, arXiv, GitHub, Books3.

Known Issues

Books3 dataset includes pirated books (litigation by authors). CommonCrawl quality varies significantly.

Source

Llama 2 paper (July 2023), Meta blog

Llama 3 (8B/70B)

MetaCutoff: March 2024

Training Data Sources

15T+ tokens. Custom web crawler, code repositories, multilingual data. 5% non-English. No Meta user data.

Known Issues

Trained on 7x more data than Llama 2. Meta's custom crawler raises web publisher concerns.

Source

Llama 3 paper (April 2024), Meta blog

Llama 3.1 (8B/70B/405B)

MetaCutoff: December 2023

Training Data Sources

15.6T tokens. Same pipeline as Llama 3 with extended multilingual data. Synthetic data used for fine-tuning.

Known Issues

405B is the largest open-weight model to date. Training cost estimated at $30M+.

Source

Llama 3.1 paper (July 2024), Meta blog

Mistral 7B

MistralCutoff: September 2023

Training Data Sources

Web data from undisclosed sources. Mistral is notably secretive about training data for an 'open' company.

Known Issues

No paper published. No training data documentation. Open weights, closed data.

Source

Mistral blog (September 2023), README on Hugging Face

Mistral Large 2

MistralCutoff: July 2024

Training Data Sources

128K context. Trained on undisclosed multilingual data. Strong in French and European languages.

Known Issues

Training data documentation remains minimal despite being a European 'open' AI lab.

Source

Mistral blog (July 2024)

DeepSeek V3

DeepSeekCutoff: November 2024

Training Data Sources

14.8T tokens. Web data, code, math. Mixture-of-Experts architecture (671B total, 37B active). Trained on H800 GPUs.

Known Issues

Trained at a fraction of the cost of Western models ($5.6M reported). Data sources not fully disclosed but appear heavily Chinese web.

Source

DeepSeek V3 technical report (January 2025)

DeepSeek R1

DeepSeekCutoff: November 2024

Training Data Sources

Built on DeepSeek V3 base with reinforcement learning for reasoning. Uses process reward models.

Known Issues

Reasoning quality varies by language. English and Chinese strongest, other languages less reliable.

Source

DeepSeek R1 paper (January 2025)

Grok 3

xAICutoff: February 2025

Training Data Sources

Web data plus X/Twitter posts. xAI has unique access to real-time Twitter data. Also code and academic papers.

Known Issues

Using Twitter user data for training is controversial. Users had no clear opt-out before xAI's formation.

Source

xAI blog (February 2025)

Qwen 2.5 (various sizes)

AlibabaCutoff: August 2024

Training Data Sources

18T tokens. Web data, code, math, multilingual (29 languages). Alibaba's proprietary data pipeline.

Known Issues

Strong Chinese language performance but English occasionally shows translation artifacts.

Source

Qwen 2.5 technical report (September 2024)

About this tracker

Most AI labs are frustratingly vague about what data they train on. "Web data" could mean anything. I built this tracker to collect every concrete detail from official papers, blog posts, and model cards into one place.

Where information is missing, I say so. I won't fill gaps with guesses. If a lab says "publicly available data" and nothing more, that's what gets recorded here.

Sources are cited for every entry. If you spot an error or have a source I'm missing, email hello@dataku.ai.