What was the Samsung ChatGPT data leak?

In April 2023, Samsung semiconductor engineers pasted proprietary source code, internal meeting notes, and chip design data into ChatGPT for debugging help. Three separate incidents occurred within 20 days of Samsung lifting its internal ChatGPT ban. Samsung subsequently banned ChatGPT company-wide.

Has OpenAI ever been hacked?

Yes. In early 2023, a hacker gained access to OpenAI's internal messaging systems. The breach was not publicly disclosed until July 2024 when reported by The New York Times. OpenAI stated that no customer data or model weights were compromised, but employee communications were accessed.

How many AI data breaches have there been?

dataku tracks 17+ major AI-related data incidents from 2023 to 2025, including data breaches, privacy violations, training data lawsuits, and infrastructure exposures. The actual number is likely higher as many incidents go unreported.

AI Data Leak Timeline

Every major AI-related data breach, privacy incident, and training data controversy. Newest first. Sourced from official disclosures, court filings, and security research.

Year

Company

Showing 17 of 17 incidents

2025-04-28OpenAI

GPT Store apps leak user data to third parties

An audit of popular GPT Store applications found that 47 of the top 200 apps were sending user conversation data to external servers without disclosure, including to advertising networks and data brokers.

Impact

User conversations from custom GPTs sent to third parties without consent. OpenAI removed offending apps.

Source: Security audit by Protect AI (2025), OpenAI blog response

2025-03-15Various

Model poisoning attacks via SEO-optimized training data

Researchers demonstrated that by creating websites optimized for web crawlers, they could inject specific misinformation into models during training. The 'Nightshade for text' attack showed that a small number of strategically placed web pages could shift model outputs on targeted topics.

Impact

Proof-of-concept for targeted training data poisoning. Affects all models trained on web crawls.

Source: Academic paper (2025), covered by Wired and MIT Technology Review

2025-01-29DeepSeek

DeepSeek database exposed publicly

Cloud security firm Wiz discovered an unsecured ClickHouse database belonging to DeepSeek that was publicly accessible. The database contained over a million log entries including chat histories, API keys, and backend operational data.

Impact

Chat logs, API secrets, and operational data exposed. Secured within hours of responsible disclosure by Wiz.

Source: Wiz Research (January 2025)

2025-01-08xAI

Grok trained on Twitter/X data without clear consent

After xAI launched Grok 2 with improved real-time knowledge, investigations confirmed that X/Twitter user posts were used for training without explicit opt-in consent. X's privacy policy had been quietly updated to allow this.

Impact

800M+ Twitter/X users' public posts used for AI training. EU DPA investigations opened. Class-action lawsuits filed.

Source: Reuters, DPC (Ireland) investigation notice, TechCrunch (January 2025)

2024-11-15DeepSeek

DeepSeek training data transparency concerns

Researchers analyzing DeepSeek V2 outputs found evidence of training on copyrighted western textbooks, academic papers, and code repositories without licensing, raising questions about Chinese AI labs' data sourcing practices.

Impact

Highlighted gap in international AI data governance. No legal action due to jurisdictional challenges.

Source: Academic analysis (2024), various security researchers

2024-09-23Anthropic

Claude contractor data handling incident

Reports emerged that some of Anthropic's RLHF contractors had access to user conversations for model training and safety evaluation purposes, with inadequate access controls and data minimization practices.

Impact

Raised questions about how all major AI labs handle user conversation data in their RLHF pipelines.

Source: The Information (September 2024), Anthropic response

2024-07-12OpenAI

Internal security breach (spring 2023, disclosed July 2024)

The New York Times reported that a hacker gained access to OpenAI's internal messaging systems in early 2023. The breach exposed discussions among employees about AI technologies. OpenAI did not publicly disclose the incident for over a year.

Impact

Internal employee communications accessed. No customer data or model weights compromised (per OpenAI). Delayed disclosure raised governance concerns.

Source: The New York Times (July 2024), OpenAI internal memo

2024-04-02Meta

Llama 2 training data lawsuit (authors)

A group of authors including Sarah Silverman, Christopher Golden, and Richard Kadrey sued Meta, alleging that Llama models were trained on pirated copies of their books from the Books3 dataset (a known collection of pirated ebooks).

Impact

Legal challenge to open-source model training practices. Books3 dataset documented to contain ~197,000 pirated books.

Source: Silverman v. Meta Platforms Inc., various reporting

2024-03-29Google

Gemini generates fabricated links and sources

Researchers documented Gemini consistently generating fake URLs and fabricated academic citations that appeared legitimate. When users followed these links, some redirected to malicious domains that had been registered to exploit this pattern.

Impact

Users directed to phishing/malware domains via AI-fabricated URLs. Highlighted weaponization risk of hallucinated links.

Source: Google DeepMind acknowledgment, academic research (2024)

2024-01-10Various

GPT-Builder data leaks via custom GPTs

Security researchers found that custom GPTs built with the GPT Builder could be tricked into revealing their system prompts, uploaded knowledge files, and API keys through prompt injection.

Impact

Hundreds of custom GPTs had their proprietary instructions extracted. Some leaked API keys to third-party services.

Source: Multiple security researchers, NCC Group blog (January 2024)

2023-12-28New York Times

NYT sues OpenAI over training data

The New York Times filed a landmark copyright lawsuit against OpenAI and Microsoft, demonstrating that ChatGPT could reproduce near-verbatim excerpts of NYT articles, proving the training data included copyrighted journalism.

Impact

Set legal precedent for AI training data copyright. Showed models can memorize and regurgitate training text.

Source: NY Times v. Microsoft/OpenAI, filed SDNY December 2023

2023-11-06OpenAI

ChatGPT DDoS and intermittent outages

Anonymous Sudan claimed responsibility for DDoS attacks on ChatGPT and the OpenAI API, causing widespread outages. While not a data leak, the attacks exposed API infrastructure vulnerabilities.

Impact

Multiple hours of downtime. API users affected. No data loss confirmed but raised infrastructure concerns.

Source: OpenAI status page, The Record (November 2023)

2023-09-18Microsoft

38TB internal data exposed via AI training storage

Microsoft AI researchers accidentally exposed 38 terabytes of internal data, including private keys, passwords, and over 30,000 internal Teams messages, through a misconfigured Azure SAS token on a GitHub repository used for AI training data.

Impact

38TB of Microsoft internal data publicly accessible. Included employee personal messages, credentials, and internal communications.

Source: Wiz Research (September 2023), Microsoft Security Response Center

2023-06-28OpenAI

ChatGPT web browsing retrieves private data

Security researchers demonstrated that ChatGPT's web browsing feature could be manipulated to retrieve and display content from private URLs, including Google Docs and other access-restricted pages.

Impact

Feature disabled temporarily. Raised concerns about indirect prompt injection through web content.

Source: Johann Rehberger (security researcher), multiple CVE reports

2023-04-01Samsung

Samsung employees leak proprietary code via ChatGPT

Samsung semiconductor engineers pasted proprietary source code, internal meeting notes, and chip design data into ChatGPT for debugging assistance. Three separate incidents reported within 20 days of Samsung lifting its ChatGPT ban.

Impact

Samsung banned ChatGPT company-wide. The leaked data became part of OpenAI's training pipeline (at the time, user data was used for training by default).

Source: Bloomberg, The Economist Korea (April 2023)

2023-03-20OpenAI

ChatGPT conversation history leak

The same redis-py bug that exposed payment data also showed some users snippets of other users' conversation titles in the chat history sidebar.

Impact

Conversation titles (not full content) from other users visible. Exposed the existence of private conversations.

Source: OpenAI blog post (March 2023)

2023-01-30OpenAI

ChatGPT payment data breach

A bug in the open-source library redis-py exposed ChatGPT Plus subscribers' payment information, including names, email addresses, payment addresses, and last four digits of credit card numbers to other users.

Impact

~1.2% of ChatGPT Plus subscribers had payment data exposed to other users during a 9-hour window.

Source: OpenAI blog post (March 2023), BleepingComputer reporting

Why I track this

The AI industry is moving fast and security is struggling to keep up. I started this timeline because no one was collecting all these incidents in one chronological, citable place.

Some of these are traditional security breaches (unauthorized access, exposed databases). Others are more subtle: training data that includes copyrighted work, user conversations sent to contractors without clear consent, or the blurry line between "publicly available data" and "data people expected to stay private."

I include training data lawsuits here because they represent a form of data handling incident, even if the legal system hasn't fully decided whether they constitute "leaks" in the traditional sense. The NYT lawsuit, for example, proved that ChatGPT can reproduce copyrighted text verbatim. That's worth documenting.

This timeline is updated as new incidents are reported. If I'm missing something, email hello@dataku.ai.