Skip to main content
TopAIThreats home TOP AI THREATS
Harm Mechanism

Proxy Variable

A data attribute that correlates with a protected characteristic, enabling indirect algorithmic discrimination even when the protected attribute is excluded.

Definition

A proxy variable is a data attribute that is statistically correlated with a protected characteristic — such as race, gender, or national origin — and can serve as an indirect stand-in for that characteristic in algorithmic decision-making. When protected attributes are formally excluded from a model, proxy variables such as postcode, name, language, or educational institution may still encode the same demographic information. This allows discriminatory outcomes to persist even when systems appear facially neutral, making proxy discrimination a particularly difficult form of algorithmic bias to detect and remediate.

How It Relates to AI Threats

Proxy variables are a core harm mechanism within Discrimination & Social Harm, specifically enabling proxy discrimination and allocational harm. Automated decision systems in lending, hiring, criminal justice, and public services may produce discriminatory outcomes by relying on variables that correlate with protected characteristics. Because these correlations are often non-obvious — such as browser type correlating with income level, or commute distance correlating with race — proxy discrimination can be difficult to identify through standard audit procedures.

Why It Occurs

  • High-dimensional datasets contain numerous variables that correlate with protected characteristics
  • Machine learning models optimise for predictive accuracy, which incentivises exploiting any informative correlation
  • Removing a protected attribute does not remove proxy variables that encode the same information
  • Standard model evaluation metrics do not surface disparate impact across demographic subgroups
  • Regulatory frameworks often focus on explicit use of protected attributes rather than proxy effects

Real-World Context

The Dutch childcare benefits scandal (INC-13-0001) is a documented case where an automated fraud detection system used nationality as a proxy variable, disproportionately flagging families with dual nationality for investigation and benefit recovery. This resulted in tens of thousands of families being wrongly accused of fraud, leading to severe financial hardship and ultimately contributing to the resignation of the Dutch government in 2021. Proxy variables have also been identified in U.S. healthcare algorithms that used cost as a proxy for health need, systematically under-referring Black patients for care management programmes.

Last updated: 2026-02-14