Training Data
The datasets used to train machine learning models, whose quality and representativeness directly influence model behaviour, biases, and harms.
Definition
Training data refers to the datasets used to train machine learning models, providing the examples from which models learn patterns, associations, and decision boundaries. The composition, quality, and provenance of training data directly determine a model’s behaviour, including its accuracy, biases, and failure modes. Training data for large-scale AI systems may include text corpora, image datasets, audio recordings, and structured records, often sourced from the internet or proprietary databases. The curation and governance of training data is a foundational concern in AI safety and ethics.
How It Relates to AI Threats
Training data intersects with threats across multiple domains. Within Discrimination & Social Harm, unrepresentative or historically biased training data produces models that perpetuate discrimination through data imbalance bias. Within Privacy & Surveillance, training data that includes personal information creates risks of model inversion and data extraction, where adversaries can recover individual records from trained models. Within Information Integrity, poisoned training data can be used to manipulate model outputs, and models trained on unreliable data may generate misinformation.
Why It Occurs
- Large-scale models require massive datasets that are difficult to fully audit or curate
- Internet-sourced training data reflects existing societal biases and inaccuracies
- Personal information included in training data may violate privacy regulations or individual consent
- Data poisoning attacks can deliberately introduce malicious patterns during training
- Lack of standardised data provenance and documentation practices across the industry
Real-World Context
Amazon’s AI hiring tool (INC-18-0002) demonstrated how training data reflecting historical hiring patterns produced a model that systematically penalised female applicants, as the model learned from a decade of male-dominated hiring outcomes. Samsung engineers inadvertently exposed proprietary code by inputting it into ChatGPT (INC-23-0002), highlighting how user inputs can become part of model training pipelines and raising questions about data governance. Ongoing litigation regarding the use of copyrighted material in training datasets for large language models and image generators reflects broader unresolved questions about the legal and ethical boundaries of training data sourcing.
Related Incidents
Amazon AI Recruiting Tool Gender Bias
Samsung Semiconductor Trade Secret Leak via ChatGPT
New York Times Copyright Lawsuit Against OpenAI
GitHub Copilot Reproduces Verbatim Training Data Including Secrets
Related Threat Patterns
Related Terms
Last updated: 2026-02-14