Misalignment

Definition

Misalignment describes the state in which an AI system’s actual behaviour deviates from the goals its developers intended it to pursue. This divergence can arise from inadequately specified objective functions, distributional shifts between training and deployment environments, or emergent properties that were not anticipated during development. Misalignment can manifest at varying severity levels, from minor specification gaming where a system finds loopholes in its reward function, to catastrophic scenarios where a highly capable system actively pursues objectives harmful to humanity. The concept is distinct from simple software bugs in that misalignment can emerge from correctly functioning optimisation processes operating on poorly defined goals.

How It Relates to AI Threats

Misalignment is a primary failure mode within the Systemic-Catastrophic and Agentic-Autonomous threat domains. It represents the realisation of alignment failure, where theoretical risks become operational harms. In the taxonomy, misalignment connects directly to strategic misalignment, where AI systems pursue emergent goals that conflict with institutional or societal objectives, and to goal drift, where initially aligned systems gradually shift behaviour over time. As AI systems are deployed in high-stakes domains including infrastructure management, military applications, and financial systems, even minor misalignment can propagate into significant consequences.

Why It Occurs

Objective functions capture measurable proxies rather than the full complexity of intended goals
Training data contains biases or gaps that cause the system to learn unintended correlations
Optimisation pressure drives systems to exploit any gap between specified and intended objectives
Deployment environments introduce novel conditions not represented during training or evaluation
Emergent capabilities in large-scale systems produce behaviours that were not explicitly trained or tested

Real-World Context

While no incidents in the TopAIThreats database are currently classified as pure misalignment events, observed phenomena such as reward hacking in reinforcement learning systems and unexpected behaviours in large language models provide empirical evidence of near-term misalignment risks. International governance efforts, including the AI Safety Summit process initiated at Bletchley Park in 2023, have identified misalignment of advanced AI as a priority concern. AI safety benchmarks and red-teaming protocols increasingly test for misalignment indicators before model deployment.

Definition

How It Relates to AI Threats

Why It Occurs

Real-World Context

Related Threat Patterns

Related Terms