Anonymization

Definition

Anonymization is the process of transforming personal data so that the individuals to whom it relates can no longer be identified, either directly or indirectly. Techniques include removing identifying fields, generalising values into broader categories, adding statistical noise, and aggregating records. Properly anonymized data falls outside the scope of data protection regulations such as the GDPR, making it a critical tool for enabling data sharing and research while protecting individual privacy. However, advances in AI and machine learning have demonstrated that many anonymization techniques previously considered robust can be defeated through re-identification attacks that cross-reference anonymized datasets with publicly available information or use pattern recognition to infer identities.

How It Relates to AI Threats

Anonymization is a key concern within the Privacy and Surveillance domain. The promise of anonymization underpins much of modern data governance: organisations collect and share data on the assumption that removing direct identifiers is sufficient to protect privacy. In the re-identification attacks sub-category, AI techniques undermine this assumption by inferring individual identities from supposedly anonymous datasets. Machine learning models can identify unique behavioural patterns, correlate records across datasets, and exploit the sparsity of high-dimensional data to re-identify individuals with high accuracy. This threatens the foundational privacy guarantees that organisations and regulators rely upon.

Why It Occurs

High-dimensional datasets contain enough unique feature combinations to identify individuals despite anonymization
AI models can cross-reference anonymized data with external datasets to reconstruct identities
Traditional anonymization techniques were designed before modern machine learning capabilities existed
Organisations often apply insufficient anonymization, removing obvious identifiers while leaving behavioural fingerprints intact
The proliferation of publicly available data increases the attack surface for re-identification

Real-World Context

Landmark research has demonstrated re-identification in datasets previously considered anonymous, including the Netflix Prize dataset, Massachusetts hospital discharge records, and New York City taxi records. Studies have shown that as few as four spatiotemporal data points can uniquely identify 95 percent of individuals in a mobility dataset. These findings have prompted regulatory bodies to reconsider the adequacy of traditional anonymization standards. The Article 29 Working Party and its successor, the European Data Protection Board, have issued guidance emphasising that anonymization must be assessed against the re-identification capabilities of current technology, including AI.

Definition

How It Relates to AI Threats

Why It Occurs

Real-World Context

Related Threat Patterns

Related Terms