Re-Identification
The process of linking supposedly anonymised or de-identified data back to specific individuals, a capability dramatically enhanced by AI techniques that can cross-reference diverse data sources.
Definition
Re-identification is the process of matching anonymised or pseudonymised data records to the real-world individuals they describe, effectively reversing privacy protections applied during data processing. Traditional anonymisation techniques such as removing names, masking identifiers, and aggregating records have been shown to provide inadequate protection against sophisticated re-identification attacks. AI and machine learning dramatically increase re-identification capabilities by enabling the automated cross-referencing of multiple datasets, the inference of missing attributes, and the identification of unique patterns in high-dimensional data. Research has demonstrated that as few as three or four ostensibly innocuous data points — such as date of birth, postal code, and gender — can uniquely identify a large proportion of individuals in a population.
How It Relates to AI Threats
Re-identification is a primary threat vector within the Privacy & Surveillance domain, undermining the foundational assumption that data can be safely shared or published after anonymisation. AI-powered re-identification attacks exploit the growing availability of auxiliary datasets — social media profiles, commercial databases, public records — to triangulate identities from sparse anonymised data. This capability threatens medical research datasets, census data, mobility data, and any dataset assumed to be safe for release. The threat also connects to sensitive attribute inference, where AI models deduce protected characteristics such as health conditions, political views, or sexual orientation from patterns in anonymised behavioural data.
Why It Occurs
- AI can cross-reference multiple datasets at scale, identifying unique patterns invisible to manual analysis
- Increasing volumes of publicly available personal data provide auxiliary information for matching
- Traditional anonymisation techniques remove direct identifiers but leave quasi-identifiers intact
- High-dimensional datasets contain unique combinations of attributes that function as fingerprints
- Differential privacy and other robust techniques remain underutilised due to complexity and accuracy trade-offs
Real-World Context
Re-identification risks are documented in incidents such as INC-20-0001, where data presumed to be adequately anonymised proved vulnerable to identity recovery. Landmark research by Latanya Sweeney demonstrated that 87% of the U.S. population could be uniquely identified by the combination of date of birth, gender, and five-digit postal code. The GDPR addresses re-identification risk through its definition of personal data, which encompasses any data that could be linked to an identifiable individual. Regulatory guidance from data protection authorities increasingly requires organisations to conduct re-identification risk assessments before releasing or sharing datasets.
Related Threat Patterns
Related Terms
Last updated: 2026-02-14