Anonymization
The process of removing or obscuring personally identifiable information from datasets to protect individual privacy, which AI techniques can increasingly defeat through re-identification attacks.
Definition
Anonymization is the process of transforming personal data so that the individuals to whom it relates can no longer be identified, either directly or indirectly. Techniques include removing identifying fields, generalising values into broader categories, adding statistical noise, and aggregating records. Properly anonymized data falls outside the scope of data protection regulations such as the GDPR, making it a critical tool for enabling data sharing and research while protecting individual privacy. However, advances in AI and machine learning have demonstrated that many anonymization techniques previously considered robust can be defeated through re-identification attacks that cross-reference anonymized datasets with publicly available information or use pattern recognition to infer identities.
How It Relates to AI Threats
Anonymization is a key concern within the Privacy and Surveillance domain. The promise of anonymization underpins much of modern data governance: organisations collect and share data on the assumption that removing direct identifiers is sufficient to protect privacy. In the re-identification attacks sub-category, AI techniques undermine this assumption by inferring individual identities from supposedly anonymous datasets. Machine learning models can identify unique behavioural patterns, correlate records across datasets, and exploit the sparsity of high-dimensional data to re-identify individuals with high accuracy. This threatens the foundational privacy guarantees that organisations and regulators rely upon.
Why It Occurs
- High-dimensional datasets contain enough unique feature combinations to identify individuals despite anonymization
- AI models can cross-reference anonymized data with external datasets to reconstruct identities
- Traditional anonymization techniques were designed before modern machine learning capabilities existed
- Organisations often apply insufficient anonymization, removing obvious identifiers while leaving behavioural fingerprints intact
- The proliferation of publicly available data increases the attack surface for re-identification
Real-World Context
Landmark research has demonstrated re-identification in datasets previously considered anonymous, including the Netflix Prize dataset, Massachusetts hospital discharge records, and New York City taxi records. Studies have shown that as few as four spatiotemporal data points can uniquely identify 95 percent of individuals in a mobility dataset. These findings have prompted regulatory bodies to reconsider the adequacy of traditional anonymization standards. The Article 29 Working Party and its successor, the European Data Protection Board, have issued guidance emphasising that anonymization must be assessed against the re-identification capabilities of current technology, including AI.
Related Threat Patterns
Related Terms
Last updated: 2026-02-14