Data Leakage
Unintended exposure of sensitive or personal data, including through AI system inputs or outputs.
Definition
Data leakage refers to the unintended or unauthorised exposure of sensitive, proprietary, or personal information. In AI contexts, data leakage occurs through multiple pathways: users may input confidential data into AI systems that retain or incorporate it into training datasets; models may memorise and reproduce fragments of training data in their outputs; and model inversion or extraction attacks can reconstruct private information from a model’s parameters or responses. Data leakage in AI systems is distinct from traditional data breaches in that it can occur without any network intrusion — the AI system itself becomes the vector through which sensitive information is exposed.
How It Relates to AI Threats
Data leakage is a core harm mechanism within the Privacy & Surveillance domain. Model inversion and data extraction techniques allow adversaries to recover training data — including personal information, proprietary code, or confidential documents — from deployed AI models. Behavioural profiling without consent occurs when AI systems aggregate user interactions to build detailed profiles that the user did not intend to disclose. Within Security & Cyber, data leakage through AI systems represents an expanding attack surface, as organisations that adopt AI tools may inadvertently expose sensitive data to third-party model providers or through publicly accessible AI interfaces.
Why It Occurs
- Users input sensitive data into AI tools without understanding retention and training policies
- Large language models can memorise and reproduce verbatim excerpts from training data under certain prompting conditions
- Model inversion techniques can extract private information by systematically querying a model’s outputs
- Organisational policies for AI tool usage often lag behind adoption, leaving gaps in data handling governance
- The boundary between user input, model memory, and model output is not always transparent or well-controlled
Real-World Context
The Samsung data leakage incident (INC-23-0002) demonstrated how employees using AI coding assistants inadvertently exposed proprietary source code and internal meeting notes by inputting them into a third-party large language model. The incident prompted Samsung to restrict internal use of generative AI tools and highlighted the broader risk faced by organisations adopting AI without adequate data governance policies. Similar incidents have been reported across multiple industries, leading to the development of enterprise AI usage policies and the deployment of data loss prevention tools specifically designed to monitor and restrict sensitive data flows to AI systems.
Related Incidents
Samsung Semiconductor Trade Secret Leak via ChatGPT
Unit 42 Demonstrates Persistent Memory Injection in Amazon Bedrock Agents
ChatGPT Jailbreak Reveals Windows Product Keys via Game Prompt
ChatGPT Shared Conversations Indexed by Search Engines, Exposing Sensitive Data
EchoLeak: Zero-Click Prompt Injection in Microsoft 365 Copilot (CVE-2025-32711)
Related Threat Patterns
Related Terms
Last updated: 2026-02-14