Re-Identification

Definition

Re-identification is the process of matching anonymised or pseudonymised data records to the real-world individuals they describe, effectively reversing privacy protections applied during data processing. Traditional anonymisation techniques such as removing names, masking identifiers, and aggregating records have been shown to provide inadequate protection against sophisticated re-identification attacks. AI and machine learning dramatically increase re-identification capabilities by enabling the automated cross-referencing of multiple datasets, the inference of missing attributes, and the identification of unique patterns in high-dimensional data. Research has demonstrated that as few as three or four ostensibly innocuous data points — such as date of birth, postal code, and gender — can uniquely identify a large proportion of individuals in a population.

How It Relates to AI Threats

Re-identification is a primary threat vector within the Privacy & Surveillance domain, undermining the foundational assumption that data can be safely shared or published after anonymisation. AI-powered re-identification attacks exploit the growing availability of auxiliary datasets — social media profiles, commercial databases, public records — to triangulate identities from sparse anonymised data. This capability threatens medical research datasets, census data, mobility data, and any dataset assumed to be safe for release. The threat also connects to sensitive attribute inference, where AI models deduce protected characteristics such as health conditions, political views, or sexual orientation from patterns in anonymised behavioural data.

Why It Occurs

AI can cross-reference multiple datasets at scale, identifying unique patterns invisible to manual analysis
Increasing volumes of publicly available personal data provide auxiliary information for matching
Traditional anonymisation techniques remove direct identifiers but leave quasi-identifiers intact
High-dimensional datasets contain unique combinations of attributes that function as fingerprints
Differential privacy and other robust techniques remain underutilised due to complexity and accuracy trade-offs

Real-World Context

Re-identification risks are documented in incidents such as INC-20-0001, where data presumed to be adequately anonymised proved vulnerable to identity recovery. Landmark research by Latanya Sweeney demonstrated that 87% of the U.S. population could be uniquely identified by the combination of date of birth, gender, and five-digit postal code. The GDPR addresses re-identification risk through its definition of personal data, which encompasses any data that could be linked to an identifiable individual. Regulatory guidance from data protection authorities increasingly requires organisations to conduct re-identification risk assessments before releasing or sharing datasets.

Definition

How It Relates to AI Threats

Why It Occurs

Real-World Context

Related Incidents

Related Threat Patterns

Related Terms