Backdoor Attack

Definition

A backdoor attack on an AI model is a form of data poisoning in which an adversary covertly modifies the training process to embed a hidden behaviour that activates only when a specific trigger pattern appears in the model’s input. The compromised model performs normally on standard inputs, passing conventional accuracy benchmarks and validation tests, but produces attacker-chosen outputs when the trigger is present. Triggers can take various forms: a specific pixel pattern in an image, a particular phrase in text input, or a distinctive signal in audio data. The attack is particularly insidious because the backdoor persists through model fine-tuning, deployment, and standard evaluation, making detection extremely difficult without specialised analysis techniques.

How It Relates to AI Threats

Backdoor attacks are a critical concern within the Security and Cyber Threats domain. In the data poisoning sub-category, backdoor attacks represent a sophisticated supply chain vulnerability: organisations that train models on external datasets, use pre-trained model weights, or rely on third-party training infrastructure are all potentially exposed. A backdoor embedded in a model used for security-critical applications — facial recognition for access control, malware detection, or autonomous vehicle perception — could be activated at a time of the attacker’s choosing to bypass security systems, evade detection, or cause dangerous misclassifications. The attack undermines trust in the entire model supply chain.

Why It Occurs

Organisations increasingly rely on pre-trained models and external datasets whose provenance is difficult to verify
Backdoor triggers can be designed to be statistically rare, avoiding detection during standard evaluation
The high dimensionality of neural network parameters makes it infeasible to manually inspect models for hidden behaviours
Training pipeline security receives less attention than deployment security in most organisations
Current model auditing techniques lack reliable methods for certifying the absence of backdoors

Real-World Context

Academic research has demonstrated successful backdoor attacks across multiple AI modalities, including image classification, natural language processing, and speech recognition systems. Researchers have shown that backdoored models can be distributed through popular model-sharing platforms, potentially affecting thousands of downstream deployments. Defence techniques including neural cleanse, activation clustering, and spectral signature analysis have been developed, but none provides comprehensive protection. The growing practice of fine-tuning large pre-trained models for specialised applications has expanded the attack surface, as backdoors embedded in foundation models can persist through the fine-tuning process.

Definition

How It Relates to AI Threats

Why It Occurs

Real-World Context

Related Threat Patterns

Related Terms