Representation Gap

Definition

A representation gap occurs when certain populations, languages, cultures, or conditions are significantly underrepresented in the data used to train AI systems. Because machine learning models learn patterns from their training data, gaps in representation translate directly into gaps in performance. Groups that are underrepresented in training datasets may experience higher error rates, less accurate predictions, or complete exclusion from system functionality. Representation gaps can arise from historical data collection biases, geographic limitations in dataset sourcing, socioeconomic barriers to digital participation, or deliberate choices about which populations to include in development and testing processes.

How It Relates to AI Threats

Representation gaps are a foundational concern within the Discrimination and Social Harm Threats domain, specifically the data-imbalance-bias sub-category. When AI systems trained on unrepresentative data are deployed in consequential settings — healthcare diagnostics, criminal justice risk assessment, hiring tools, or financial services — the resulting performance disparities can produce systematic disadvantage for already marginalised groups. The harm compounds when these systems are presented as objective or universally applicable, masking the underlying data limitations. Representation gaps also contribute to representational harms, where AI-generated content reinforces stereotypes or erases the experiences and identities of underrepresented communities.

Why It Occurs

Data collection historically over-samples dominant demographic groups and English-language sources
Marginalised communities may have less digital presence, producing fewer data points for training
Developers may test primarily with populations that are convenient to access rather than representative
Commercial incentives prioritise performance for the largest market segments over equitable coverage
Feedback loops degrade representation further as underserved users disengage from poorly performing systems

Real-World Context

Representation gaps have been documented across multiple AI application domains. Facial recognition systems have exhibited significantly higher error rates for women and individuals with darker skin tones due to training data skewed toward lighter-skinned male faces. Medical AI systems trained primarily on data from specific ethnic groups have produced inaccurate diagnostic recommendations for other populations. Natural language processing models show measurably lower performance for languages and dialects underrepresented in training corpora, effectively excluding hundreds of millions of people from AI-powered services.

Definition

How It Relates to AI Threats

Why It Occurs

Real-World Context

Related Threat Patterns

Related Terms