Skip to main content
TopAIThreats home TOP AI THREATS
Governance Concept

Pseudonymization

Replacing direct identifiers in datasets with artificial identifiers while maintaining data utility, a privacy-enhancing technique required by GDPR but vulnerable to AI-powered re-identification.

Definition

Pseudonymization is a data processing technique that replaces directly identifying information — such as names, social security numbers, or email addresses — with artificial identifiers or pseudonyms, while retaining the ability to re-link data to individuals through a separately maintained key. Unlike full anonymization, pseudonymized data remains personal data under most regulatory frameworks because re-identification is technically possible. The technique is widely used in healthcare, research, and analytics to balance data utility with privacy protection. The EU General Data Protection Regulation specifically references pseudonymization as a recommended safeguard, while acknowledging that it does not eliminate the obligations of data controllers.

How It Relates to AI Threats

Pseudonymization is directly relevant to the Privacy and Surveillance Threats domain, particularly the re-identification-attacks sub-category. While pseudonymization provides a meaningful layer of protection against casual identification, AI systems have demonstrated the ability to re-identify individuals in pseudonymized datasets by cross-referencing auxiliary data sources and identifying unique behavioural patterns. Machine learning models can correlate pseudonymized records across datasets using quasi-identifiers — combinations of attributes that, taken together, uniquely identify individuals. This undermines the privacy guarantees that organisations and individuals rely upon, particularly as the volume of available auxiliary data continues to grow.

Why It Occurs

  • Pseudonymization preserves data structure and utility, making it attractive for analytics and research
  • Regulatory frameworks encourage pseudonymization as a practical intermediate step between raw data and full anonymization
  • Organisations often treat pseudonymized data as sufficiently de-identified, underestimating re-identification risks
  • The proliferation of auxiliary datasets provides attackers with abundant material for cross-referencing
  • AI advances in pattern recognition make it possible to re-identify individuals from increasingly sparse data points

Real-World Context

Research has repeatedly demonstrated that pseudonymized datasets can be re-identified with high accuracy. Studies have shown that as few as four spatio-temporal data points can uniquely identify 95 percent of individuals in mobile phone datasets, even after pseudonymization. Health data releases intended for research have been linked back to named individuals through publicly available records. These findings have informed regulatory guidance, with data protection authorities increasingly emphasising that pseudonymization alone is insufficient for high-risk data processing and must be combined with additional technical and organisational safeguards.

Last updated: 2026-02-14