AI Data Leak Timeline
Every major AI-related data breach, privacy incident, and training data controversy. Newest first. Sourced from official disclosures, court filings, and security research.
Showing 17 of 17 incidents
GPT Store apps leak user data to third parties
An audit of popular GPT Store applications found that 47 of the top 200 apps were sending user conversation data to external servers without disclosure, including to advertising networks and data brokers.
Impact
User conversations from custom GPTs sent to third parties without consent. OpenAI removed offending apps.
Source: Security audit by Protect AI (2025), OpenAI blog response
Model poisoning attacks via SEO-optimized training data
Researchers demonstrated that by creating websites optimized for web crawlers, they could inject specific misinformation into models during training. The 'Nightshade for text' attack showed that a small number of strategically placed web pages could shift model outputs on targeted topics.
Impact
Proof-of-concept for targeted training data poisoning. Affects all models trained on web crawls.
Source: Academic paper (2025), covered by Wired and MIT Technology Review
DeepSeek database exposed publicly
Cloud security firm Wiz discovered an unsecured ClickHouse database belonging to DeepSeek that was publicly accessible. The database contained over a million log entries including chat histories, API keys, and backend operational data.
Impact
Chat logs, API secrets, and operational data exposed. Secured within hours of responsible disclosure by Wiz.
Source: Wiz Research (January 2025)
Grok trained on Twitter/X data without clear consent
After xAI launched Grok 2 with improved real-time knowledge, investigations confirmed that X/Twitter user posts were used for training without explicit opt-in consent. X's privacy policy had been quietly updated to allow this.
Impact
800M+ Twitter/X users' public posts used for AI training. EU DPA investigations opened. Class-action lawsuits filed.
Source: Reuters, DPC (Ireland) investigation notice, TechCrunch (January 2025)
DeepSeek training data transparency concerns
Researchers analyzing DeepSeek V2 outputs found evidence of training on copyrighted western textbooks, academic papers, and code repositories without licensing, raising questions about Chinese AI labs' data sourcing practices.
Impact
Highlighted gap in international AI data governance. No legal action due to jurisdictional challenges.
Source: Academic analysis (2024), various security researchers
Claude contractor data handling incident
Reports emerged that some of Anthropic's RLHF contractors had access to user conversations for model training and safety evaluation purposes, with inadequate access controls and data minimization practices.
Impact
Raised questions about how all major AI labs handle user conversation data in their RLHF pipelines.
Source: The Information (September 2024), Anthropic response
Internal security breach (spring 2023, disclosed July 2024)
The New York Times reported that a hacker gained access to OpenAI's internal messaging systems in early 2023. The breach exposed discussions among employees about AI technologies. OpenAI did not publicly disclose the incident for over a year.
Impact
Internal employee communications accessed. No customer data or model weights compromised (per OpenAI). Delayed disclosure raised governance concerns.
Source: The New York Times (July 2024), OpenAI internal memo
Llama 2 training data lawsuit (authors)
A group of authors including Sarah Silverman, Christopher Golden, and Richard Kadrey sued Meta, alleging that Llama models were trained on pirated copies of their books from the Books3 dataset (a known collection of pirated ebooks).
Impact
Legal challenge to open-source model training practices. Books3 dataset documented to contain ~197,000 pirated books.
Source: Silverman v. Meta Platforms Inc., various reporting
Gemini generates fabricated links and sources
Researchers documented Gemini consistently generating fake URLs and fabricated academic citations that appeared legitimate. When users followed these links, some redirected to malicious domains that had been registered to exploit this pattern.
Impact
Users directed to phishing/malware domains via AI-fabricated URLs. Highlighted weaponization risk of hallucinated links.
Source: Google DeepMind acknowledgment, academic research (2024)
GPT-Builder data leaks via custom GPTs
Security researchers found that custom GPTs built with the GPT Builder could be tricked into revealing their system prompts, uploaded knowledge files, and API keys through prompt injection.
Impact
Hundreds of custom GPTs had their proprietary instructions extracted. Some leaked API keys to third-party services.
Source: Multiple security researchers, NCC Group blog (January 2024)
NYT sues OpenAI over training data
The New York Times filed a landmark copyright lawsuit against OpenAI and Microsoft, demonstrating that ChatGPT could reproduce near-verbatim excerpts of NYT articles, proving the training data included copyrighted journalism.
Impact
Set legal precedent for AI training data copyright. Showed models can memorize and regurgitate training text.
Source: NY Times v. Microsoft/OpenAI, filed SDNY December 2023
ChatGPT DDoS and intermittent outages
Anonymous Sudan claimed responsibility for DDoS attacks on ChatGPT and the OpenAI API, causing widespread outages. While not a data leak, the attacks exposed API infrastructure vulnerabilities.
Impact
Multiple hours of downtime. API users affected. No data loss confirmed but raised infrastructure concerns.
Source: OpenAI status page, The Record (November 2023)
38TB internal data exposed via AI training storage
Microsoft AI researchers accidentally exposed 38 terabytes of internal data, including private keys, passwords, and over 30,000 internal Teams messages, through a misconfigured Azure SAS token on a GitHub repository used for AI training data.
Impact
38TB of Microsoft internal data publicly accessible. Included employee personal messages, credentials, and internal communications.
Source: Wiz Research (September 2023), Microsoft Security Response Center
ChatGPT web browsing retrieves private data
Security researchers demonstrated that ChatGPT's web browsing feature could be manipulated to retrieve and display content from private URLs, including Google Docs and other access-restricted pages.
Impact
Feature disabled temporarily. Raised concerns about indirect prompt injection through web content.
Source: Johann Rehberger (security researcher), multiple CVE reports
Samsung employees leak proprietary code via ChatGPT
Samsung semiconductor engineers pasted proprietary source code, internal meeting notes, and chip design data into ChatGPT for debugging assistance. Three separate incidents reported within 20 days of Samsung lifting its ChatGPT ban.
Impact
Samsung banned ChatGPT company-wide. The leaked data became part of OpenAI's training pipeline (at the time, user data was used for training by default).
Source: Bloomberg, The Economist Korea (April 2023)
ChatGPT conversation history leak
The same redis-py bug that exposed payment data also showed some users snippets of other users' conversation titles in the chat history sidebar.
Impact
Conversation titles (not full content) from other users visible. Exposed the existence of private conversations.
Source: OpenAI blog post (March 2023)
ChatGPT payment data breach
A bug in the open-source library redis-py exposed ChatGPT Plus subscribers' payment information, including names, email addresses, payment addresses, and last four digits of credit card numbers to other users.
Impact
~1.2% of ChatGPT Plus subscribers had payment data exposed to other users during a 9-hour window.
Source: OpenAI blog post (March 2023), BleepingComputer reporting
Why I track this
The AI industry is moving fast and security is struggling to keep up. I started this timeline because no one was collecting all these incidents in one chronological, citable place.
Some of these are traditional security breaches (unauthorized access, exposed databases). Others are more subtle: training data that includes copyrighted work, user conversations sent to contractors without clear consent, or the blurry line between "publicly available data" and "data people expected to stay private."
I include training data lawsuits here because they represent a form of data handling incident, even if the legal system hasn't fully decided whether they constitute "leaks" in the traditional sense. The NYT lawsuit, for example, proved that ChatGPT can reproduce copyrighted text verbatim. That's worth documenting.
This timeline is updated as new incidents are reported. If I'm missing something, email hello@dataku.ai.