AI Training Data Tracker
What data went into each model? Training sources, cutoff dates, and known issues. This is my attempt to document what the labs won't tell you clearly.
GPT-4 / GPT-4 Turbo
OpenAICutoff: April 2024Training Data Sources
Web crawl data, books, code repositories, Wikipedia, licensed datasets. Specific composition undisclosed.
Known Issues
Training data copyright lawsuits from NYT, Authors Guild. Potential memorization of copyrighted text.
Source
OpenAI technical report (March 2023), system card updates
GPT-4o
OpenAICutoff: October 2023Training Data Sources
Same base as GPT-4 plus additional multimodal data (images, audio). Web crawl through late 2023.
Known Issues
Inherits GPT-4 copyright concerns. Image training data sources undisclosed.
Source
OpenAI blog (May 2024), API documentation
GPT-4o Mini
OpenAICutoff: October 2023Training Data Sources
Distilled from GPT-4o. Training mix not separately disclosed.
Known Issues
Distilled model, so inherits base model biases at lower fidelity.
Source
OpenAI blog (July 2024)
o1 / o1-mini
OpenAICutoff: October 2023Training Data Sources
GPT-4o base plus reinforcement learning on chain-of-thought reasoning. Additional math and science training data.
Known Issues
Reasoning traces can include fabricated intermediate steps. Higher hallucination rate on factual recall.
Source
OpenAI blog (September 2024), technical report
Claude 3 Opus / Sonnet / Haiku
AnthropicCutoff: August 2023Training Data Sources
Web data, public code, books, academic papers. Constitutional AI (RLHF with AI feedback). Specific mix undisclosed.
Known Issues
Training data composition fully undisclosed. Anthropic publishes usage policies but not data sources.
Source
Anthropic model card (March 2024)
Claude 3.5 Sonnet
AnthropicCutoff: April 2024Training Data Sources
Updated training data beyond Claude 3 cutoff. Specific sources not disclosed. Extended code training.
Known Issues
Same opacity as Claude 3. No public data audit.
Source
Anthropic blog (June 2024)
Claude Opus 4 / Sonnet 4
AnthropicCutoff: March 2025Training Data Sources
Expanded training corpus. Likely includes web data through early 2025. Details not disclosed.
Known Issues
Anthropic remains the least transparent major lab about training data composition.
Source
Anthropic blog (June 2025)
Gemini 1.5 Pro
GoogleCutoff: November 2023Training Data Sources
Web documents, books, code, math data, multimodal data (images, video, audio). Trained on Google's proprietary data pipeline.
Known Issues
Uses Google Search index data, raising questions about data consent. YouTube transcripts reportedly included.
Source
Gemini technical report (Feb 2024), Google blog
Gemini 2.0 Flash / Pro
GoogleCutoff: August 2024Training Data Sources
Updated Gemini pipeline. Extended web, code, and multimodal data. Specific composition not disclosed.
Known Issues
YouTube and Google Books content inclusion contested. Reddit licensing deal covers only post-2024 data.
Source
Google blog (February 2025)
Llama 2 (7B/13B/70B)
MetaCutoff: July 2023Training Data Sources
2T tokens from publicly available sources. No Meta user data. Includes CommonCrawl, C4, Wikipedia, arXiv, GitHub, Books3.
Known Issues
Books3 dataset includes pirated books (litigation by authors). CommonCrawl quality varies significantly.
Source
Llama 2 paper (July 2023), Meta blog
Llama 3 (8B/70B)
MetaCutoff: March 2024Training Data Sources
15T+ tokens. Custom web crawler, code repositories, multilingual data. 5% non-English. No Meta user data.
Known Issues
Trained on 7x more data than Llama 2. Meta's custom crawler raises web publisher concerns.
Source
Llama 3 paper (April 2024), Meta blog
Llama 3.1 (8B/70B/405B)
MetaCutoff: December 2023Training Data Sources
15.6T tokens. Same pipeline as Llama 3 with extended multilingual data. Synthetic data used for fine-tuning.
Known Issues
405B is the largest open-weight model to date. Training cost estimated at $30M+.
Source
Llama 3.1 paper (July 2024), Meta blog
Mistral 7B
MistralCutoff: September 2023Training Data Sources
Web data from undisclosed sources. Mistral is notably secretive about training data for an 'open' company.
Known Issues
No paper published. No training data documentation. Open weights, closed data.
Source
Mistral blog (September 2023), README on Hugging Face
Mistral Large 2
MistralCutoff: July 2024Training Data Sources
128K context. Trained on undisclosed multilingual data. Strong in French and European languages.
Known Issues
Training data documentation remains minimal despite being a European 'open' AI lab.
Source
Mistral blog (July 2024)
DeepSeek V3
DeepSeekCutoff: November 2024Training Data Sources
14.8T tokens. Web data, code, math. Mixture-of-Experts architecture (671B total, 37B active). Trained on H800 GPUs.
Known Issues
Trained at a fraction of the cost of Western models ($5.6M reported). Data sources not fully disclosed but appear heavily Chinese web.
Source
DeepSeek V3 technical report (January 2025)
DeepSeek R1
DeepSeekCutoff: November 2024Training Data Sources
Built on DeepSeek V3 base with reinforcement learning for reasoning. Uses process reward models.
Known Issues
Reasoning quality varies by language. English and Chinese strongest, other languages less reliable.
Source
DeepSeek R1 paper (January 2025)
Grok 3
xAICutoff: February 2025Training Data Sources
Web data plus X/Twitter posts. xAI has unique access to real-time Twitter data. Also code and academic papers.
Known Issues
Using Twitter user data for training is controversial. Users had no clear opt-out before xAI's formation.
Source
xAI blog (February 2025)
Qwen 2.5 (various sizes)
AlibabaCutoff: August 2024Training Data Sources
18T tokens. Web data, code, math, multilingual (29 languages). Alibaba's proprietary data pipeline.
Known Issues
Strong Chinese language performance but English occasionally shows translation artifacts.
Source
Qwen 2.5 technical report (September 2024)
About this tracker
Most AI labs are frustratingly vague about what data they train on. "Web data" could mean anything. I built this tracker to collect every concrete detail from official papers, blog posts, and model cards into one place.
Where information is missing, I say so. I won't fill gaps with guesses. If a lab says "publicly available data" and nothing more, that's what gets recorded here.
Sources are cited for every entry. If you spot an error or have a source I'm missing, email hello@dataku.ai.