Training Data

Definition

Training data refers to the datasets used to train machine learning models, providing the examples from which models learn patterns, associations, and decision boundaries. The composition, quality, and provenance of training data directly determine a model’s behaviour, including its accuracy, biases, and failure modes. Training data for large-scale AI systems may include text corpora, image datasets, audio recordings, and structured records, often sourced from the internet or proprietary databases. The curation and governance of training data is a foundational concern in AI safety and ethics.

How It Relates to AI Threats

Training data intersects with threats across multiple domains. Within Discrimination & Social Harm, unrepresentative or historically biased training data produces models that perpetuate discrimination through data imbalance bias. Within Privacy & Surveillance, training data that includes personal information creates risks of model inversion and data extraction, where adversaries can recover individual records from trained models. Within Information Integrity, poisoned training data can be used to manipulate model outputs, and models trained on unreliable data may generate misinformation.

Why It Occurs

Large-scale models require massive datasets that are difficult to fully audit or curate
Internet-sourced training data reflects existing societal biases and inaccuracies
Personal information included in training data may violate privacy regulations or individual consent
Data poisoning attacks can deliberately introduce malicious patterns during training
Lack of standardised data provenance and documentation practices across the industry

Real-World Context

Amazon’s AI hiring tool (INC-18-0002) demonstrated how training data reflecting historical hiring patterns produced a model that systematically penalised female applicants, as the model learned from a decade of male-dominated hiring outcomes. Samsung engineers inadvertently exposed proprietary code by inputting it into ChatGPT (INC-23-0002), highlighting how user inputs can become part of model training pipelines and raising questions about data governance. Ongoing litigation regarding the use of copyrighted material in training datasets for large language models and image generators reflects broader unresolved questions about the legal and ethical boundaries of training data sourcing.

Definition

How It Relates to AI Threats

Why It Occurs

Real-World Context

Related Incidents

Related Threat Patterns

Related Terms