Deepseek Distilled OpenAI Data?

Keep Data Secure!, Use AI Wisely!
January 30, 2025

It is a serious accusation: According to reports, the Chinese AI start-up Deepseek accelerated its development using data from the ChatGPT maker.

The US ChatGPT developer OpenAI makes allegations against the Chinese start-up Deepseek.

Microsoft and OpenAI are investigating whether developers associated with the Chinese AI start-up Deepseek have improperly harvested data from the ChatGPT maker. This was reported by the Bloomberg and the Financial Times.

An OpenAI spokeswoman said: “We know that companies from the People’s Republic of China – and others – are constantly trying to distill the models of leading US AI companies. As a leading AI developer, we are taking countermeasures to protect our intellectual property.”

Artificial intelligence (AI) researchers use the word “distillation” to describe the technical possibility of using the results of large AI language models to improve the results of smaller – and therefore less expensive – models.

ChatGPT developer did not provide any further details about its findings. In its terms of use, the AI start-up excludes “copying” its services or “using their results to develop models that compete with OpenAI.”

LEARN MORE: ChatGPT Fined Over Personal Data Misuse

US Authorities Step In to Investigate OpenAI Data Leak to Deepseek

David Sacks, AI representative of US President Donald Trump, told the TV channel Fox News that there was “substantial evidence” that Deepseek had relied on the results of the OpenAI models to develop its own technology.

“There is strong evidence that Deepseek has extracted knowledge from OpenAI models here, and I don’t think OpenAI is very happy about it,” Sacks said, without giving further details.

According to company sources, OpenAI has observed several attempts to copy its own models in the past. The start-up is responding to this by blocking user accounts to prevent access, it says. OpenAI is also working with its partner Microsoft on this .

In the future, it will be “crucial to work closely with the US government to best protect the most powerful models from the efforts of adversaries and competitors to acquire US technology,” the OpenAI spokeswoman continued.

Microsoft Reported Data Leaks

Insiders say they are investigating whether a group linked to Deepseek obtained data via OpenAI’s application programming interfaces (APIs). Unnamed Microsoft security experts say extensive leaks occurred in the fall.

Software developers can apply for a license to use the OpenAI APIs for a fee to integrate OpenAI’s AI models into their own applications. However, this does not include use for developing their own AI applications.

A Microsoft spokesperson declined to comment. Deepseek and the hedge fund High-Flyer, which owns Deepseek, also did not respond to a Bloomberg request.

Where Does ChatGPT Get Its Data?

ChatGPT’s data comes from a mixture of publicly available text, licensed data, and OpenAI’s proprietary sources. It has been trained on a broad range of internet text, including books, articles, and websites, but it does not (according to the company) have direct access to proprietary, private, or real-time data, such as live news, confidential databases, or personal user information.

Large language models (LLMs) like ChatGPT are trained on a mixture of the following data sources:

Publicly Available Text

Books, research papers, and freely available literature.
Wikipedia and other open-access encyclopedias.
Publicly available government documents and reports.

Licensed Data

Some LLMs use licensed datasets from publishers or content providers to improve quality and coverage.
This may include high-quality journalism, scientific papers, and industry reports.

Common Crawl and Web Scraping

Some models are trained on large-scale web data (e.g., Common Crawl), which includes publicly available blogs, news sites, and forums.
However, reputable AI developers follow legal and ethical guidelines, avoiding copyrighted or restricted content unless explicitly licensed.

Code Repositories

Some LLMs include publicly available code from sources like GitHub (subject to licensing restrictions).

Conversational Data

Some models fine-tune responses using anonymized interactions.

What LLMs Do NOT Use

Private, proprietary, or confidential data. However, there are no tools able to distinguish between such types of data in a meeting context, for example.
Paywalled or restricted content (unless licensed).
Real-time internet access.

The exact datasets used vary by model and developer.

It’s important to note that OpenAI has not publicly disclosed the full details of its training datasets for proprietary models like ChatGPT.

How to Safeguard Your Data That May Appear in LLMs?

While reputable LLMs claim to not use or store confidential data from meetings, the way data is handled depends on the specific AI platform and its privacy policies. Here’s what to consider:

Real-Time Processing vs. Storage

Most AI models process data in real-time without storing it after the interaction ends. Some meeting AI tools may temporarily store transcripts for user access but should follow strict security protocols.

Enterprise and Compliance Standards

AI meeting assistants designed for businesses (e.g., Microsoft Copilot, Eyre.ai, or Otter.ai) typically comply with GDPR, CCPA, HIPAA, and other regulations. Secure AI solutions such as Eyre.ai use end-to-end encryption to protect data.

OpenAI and ChatGPT Policy

OpenAI does not store or use conversations for training when ChatGPT is used via ChatGPT Enterprise or API. The free version may retain some interactions for improvement, but OpenAI states it does not use private data from meetings unless explicitly consented to.

Third-Party Integrations

If using an AI tool within Zoom, Microsoft Teams, or Google Meet, ensure it follows strict data security measures.
Some AI-powered meeting assistants save and analyze conversations—always check their privacy settings.

How to Ensure Security of Your Meeting Data

Use enterprise-grade AI tools with strict data protection policies.
Enable encryption and access controls to prevent unauthorized access.
Review the privacy policy of any AI tool before using it in confidential meetings.

The Story of Deepseek Is Still Evolving

The Chinese AI start-up Deepseek released a new, open source AI model called R1 earlier this month. In initial comparison tests, it proved to be at least on par with the leading models from OpenAI, Google, or Meta. At the same time, it is said to have been developed on older hardware – and at a fraction of the cost.

Observers reacted with astonishment and shock to the revelation of R1. It was said that this could turn the AI market upside down and challenge the US lead. The shares of chip manufacturers such as Nvidia and other tech companies such as Microsoft, Oracle and Google’s shares fell sharply. In total, almost a trillion dollars in market value was wiped out. However, the prices recovered quickly.

Microsoft is the most important investor in OpenAI and has invested a good 13 billion dollars in the start-up, mainly in the form of computing time in its cloud data centers.

Microsoft was the first to inform OpenAI about the activities of the group allegedly linked to Deepseek. It is also possible that the group was trying to circumvent OpenAI’s restrictions on the amount of data it received.

Author Profile

Julie Gabriel

Julie Gabriel wears many hats—founder of Eyre.ai, product marketing veteran, and, most importantly, mom of two. At Eyre.ai, she’s on a mission to make communication smarter and more seamless with AI-powered tools that actually work for people (and not the other way around). With over 20 years in product marketing, Julie knows how to build solutions that not only solve problems but also resonate with users. Balancing the chaos of entrepreneurship and family life is her superpower—and she wouldn’t have it any other way.