⏰ Download FLUENT MEETINGS Mastery Guide eBook!

Deepseek Distilled OpenAI Data?

Deepseek Secretly Siphoned Off OpenAI Data

It is a serious accusation: According to reports, the Chinese AI start-up Deepseek accelerated its development using data from the ChatGPT maker.

The US ChatGPT developer OpenAI makes allegations against the Chinese start-up Deepseek.

Microsoft and OpenAI are investigating whether developers associated with the Chinese AI start-up Deepseek have improperly harvested data from the ChatGPT maker. This was reported by the Bloomberg and the Financial Times.

An OpenAI spokeswoman said: “We know that companies from the People’s Republic of China – and others – are constantly trying to distill the models of leading US AI companies. As a leading AI developer, we are taking countermeasures to protect our intellectual property.”

Keep Your Meetings and Conversations Secure

90% of your meeting data leaks online. Want to change that? We offer familiar features such as AI meeting notes and transcripts wrapped into ironclad data privacy. Get started with an AI assistant that protects your data.

Artificial intelligence (AI) researchers use the word “distillation” to describe the technical possibility of using the results of large AI language models to improve the results of smaller – and therefore less expensive – models.

ChatGPT developer did not provide any further details about its findings. In its terms of use, the AI ​​start-up excludes “copying” its services or “using their results to develop models that compete with OpenAI.”

LEARN MORE: ChatGPT Fined Over Personal Data Misuse

US Authorities Step In to Investigate OpenAI Data Leak to Deepseek

David Sacks, AI representative of US President Donald Trump, told the TV channel Fox News that there was “substantial evidence” that Deepseek had relied on the results of the OpenAI models to develop its own technology.

“There is strong evidence that Deepseek has extracted knowledge from OpenAI models here, and I don’t think OpenAI is very happy about it,” Sacks said, without giving further details.

According to company sources, OpenAI has observed several attempts to copy its own models in the past. The start-up is responding to this by blocking user accounts to prevent access, it says. OpenAI is also working with its partner Microsoft on this .

In the future, it will be “crucial to work closely with the US government to best protect the most powerful models from the efforts of adversaries and competitors to acquire US technology,” the OpenAI spokeswoman continued.

Deepseek used ChatGPT data

Microsoft Reported Data Leaks

Insiders say they are investigating whether a group linked to Deepseek obtained data via OpenAI’s application programming interfaces (APIs). Unnamed Microsoft security experts say extensive leaks occurred in the fall.

Software developers can apply for a license to use the OpenAI APIs for a fee to integrate OpenAI’s AI models into their own applications. However, this does not include use for developing their own AI applications.

A Microsoft spokesperson declined to comment. Deepseek and the hedge fund High-Flyer, which owns Deepseek, also did not respond to a Bloomberg request.

Say Goodbye to Meeting Chaos

Try our secure AI meeting assistant to manage meeting notes, agendas, and tasks effortlessly. Sign up today for AI meeting platform designed with data privacy at the core. Perfect for industries that demand privacy and confidentiality such as legal, finance, and defense.

Where Does ChatGPT Get Its Data?

ChatGPT’s data comes from a mixture of publicly available text, licensed data, and OpenAI’s proprietary sources. It has been trained on a broad range of internet text, including books, articles, and websites, but it does not (according to the company) have direct access to proprietary, private, or real-time data, such as live news, confidential databases, or personal user information.

Large language models (LLMs) like ChatGPT are trained on a mixture of the following data sources:

Publicly Available Text

  • Books, research papers, and freely available literature.
  • Wikipedia and other open-access encyclopedias.
  • Publicly available government documents and reports.

Licensed Data

  • Some LLMs use licensed datasets from publishers or content providers to improve quality and coverage.
  • This may include high-quality journalism, scientific papers, and industry reports.

Common Crawl and Web Scraping

  • Some models are trained on large-scale web data (e.g., Common Crawl), which includes publicly available blogs, news sites, and forums.
  • However, reputable AI developers follow legal and ethical guidelines, avoiding copyrighted or restricted content unless explicitly licensed.

Code Repositories

  • Some LLMs include publicly available code from sources like GitHub (subject to licensing restrictions).

Conversational Data 

  • Some models fine-tune responses using anonymized interactions.

What LLMs Do NOT Use

  • Private, proprietary, or confidential data. However, there are no tools able to distinguish between such types of data in a meeting context, for example.
  • Paywalled or restricted content (unless licensed).
  • Real-time internet access.

The exact datasets used vary by model and developer.

It’s important to note that OpenAI has not publicly disclosed the full details of its training datasets for proprietary models like ChatGPT.

READ MORE: OpenAI Chat: Security Considerations

How to Safeguard Your Data That May Appear in LLMs?

While reputable LLMs claim to not use or store confidential data from meetings, the way data is handled depends on the specific AI platform and its privacy policies. Here’s what to consider:

Real-Time Processing vs. Storage

Most AI models process data in real-time without storing it after the interaction ends. Some meeting AI tools may temporarily store transcripts for user access but should follow strict security protocols.

Enterprise and Compliance Standards

AI meeting assistants designed for businesses (e.g., Microsoft Copilot, Eyre.ai, or Otter.ai) typically comply with GDPR, CCPA, HIPAA, and other regulations. Secure AI solutions such as Eyre.ai use end-to-end encryption to protect data.

OpenAI and ChatGPT Policy

OpenAI does not store or use conversations for training when ChatGPT is used via ChatGPT Enterprise or API. The free version may retain some interactions for improvement, but OpenAI states it does not use private data from meetings unless explicitly consented to.

Third-Party Integrations

If using an AI tool within Zoom, Microsoft Teams, or Google Meet, ensure it follows strict data security measures.
Some AI-powered meeting assistants save and analyze conversations—always check their privacy settings.

How to Ensure Security of Your Meeting Data

  • Use enterprise-grade AI tools with strict data protection policies.
  • Enable encryption and access controls to prevent unauthorized access.
  • Review the privacy policy of any AI tool before using it in confidential meetings.

Privacy Is Not an Option

Did you know that your meetings are leaking private information? You need a secure AI meeting platform you can trust. At Eyre Meet, encryption and meeting data protection are included by default. What happens in your meeting is your business.

The Story of Deepseek Is Still Evolving

The Chinese AI start-up Deepseek released a new, open source AI model called R1 earlier this month. In initial comparison tests, it proved to be at least on par with the leading models from OpenAI, Google, or Meta. At the same time, it is said to have been developed on older hardware – and at a fraction of the cost.

Observers reacted with astonishment and shock to the revelation of R1. It was said that this could turn the AI ​​market upside down and challenge the US lead. The shares of chip manufacturers such as Nvidia and other tech companies such as Microsoft, Oracle and Google’s shares fell sharply. In total, almost a trillion dollars in market value was wiped out. However, the prices recovered quickly.

Microsoft is the most important investor in OpenAI and has invested a good 13 billion dollars in the start-up, mainly in the form of computing time in its cloud data centers.

Microsoft was the first to inform OpenAI about the activities of the group allegedly linked to Deepseek. It is also possible that the group was trying to circumvent OpenAI’s restrictions on the amount of data it received.

Author Profile
Julie Gabriel

Julie Gabriel wears many hats—founder of Eyre.ai, product marketing veteran, and, most importantly, mom of two. At Eyre.ai, she’s on a mission to make communication smarter and more seamless with AI-powered tools that actually work for people (and not the other way around). With over 20 years in product marketing, Julie knows how to build solutions that not only solve problems but also resonate with users. Balancing the chaos of entrepreneurship and family life is her superpower—and she wouldn’t have it any other way.

In this article