Step-by-step workflow for identifying and responding to data poisoning attacks on AI training data, fine-tuning corpora, and RAG knowledge bases. Covers pre-training inspection, during-training monitoring, post-deployment detection, and remediation.
Who this is for: ML engineers, data engineers, security teams, and AI platform operators responsible for training data integrity, fine-tuning pipelines, or RAG knowledge base management.
What Data Poisoning Is and Why It Matters
Data poisoning is a supply chain attack on AI systems. Instead of attacking the model directly, the attacker manipulates the data the model learns from — inserting malicious examples that cause the model to produce incorrect outputs, exhibit biased behavior, or respond to hidden triggers (backdoors).
The threat is documented:
For the underlying science, see the Data Poisoning Detection Methods reference page.
Threat patterns this guide addresses
Step 1: Map Your Data Supply Chain
Before you can detect poisoning, understand where your data comes from and how it reaches the model:
Inventory all data sources — web scrapes, public datasets, licensed data, user-generated content, internal databases, third-party APIs
Document the pipeline — from raw collection through filtering, preprocessing, annotation, and final training/indexing
Identify trust levels — which sources are authenticated and audited? Which are open to public contribution?
Map access controls — who can modify datasets at each pipeline stage? Are changes logged?
Identify RAG knowledge bases — what documents feed into retrieval-augmented generation? How are they ingested?
Step 2: Pre-Training Data Inspection
Apply these checks to training and fine-tuning datasets before they reach the model.
Source verification
Verify data provenance — confirm each dataset came from its claimed source. Check cryptographic hashes against published values
Authenticate third-party data — verify provider identity and data integrity for licensed or purchased datasets
Check for unauthorized modifications — compare dataset hashes against baseline snapshots taken at collection
Statistical analysis
Run outlier detection — apply dimensionality reduction (PCA, UMAP) to training data embeddings and look for anomalous clusters
Check label consistency — compare each example's features against the expected distribution for its label. Flag examples where features are inconsistent with the assigned label
Detect near-duplicates — identify clusters of suspiciously similar examples, especially if they share unusual features or unexpected labels
Profile data distributions — compare the statistical distribution of the new dataset against known-clean baselines. Significant shifts may indicate contamination
Content scanning
Scan for instruction-like content — flag documents containing instruction patterns ("ignore previous," "you are now") that could serve as indirect prompt injection in RAG systems
Check for known contamination markers — search for known fingerprints of AI-generated or paper-mill content (e.g., "vegetative electron microscopy," "as an AI language model")
Validate factual claims — for fine-tuning data containing factual content, spot-check claims against authoritative sources
Step 3: During-Training Monitoring
If you control the training process, monitor for anomalies during training.
Track per-example loss curves — poisoned examples often converge faster than legitimate examples (the model memorizes the shortcut). Flag examples with unusually rapid loss reduction
Monitor gradient magnitudes — poisoned examples may produce unusually large or directionally unusual gradients. Log per-example gradient norms and flag outliers
Run spectral analysis — compute the covariance matrix of feature representations and check for anomalous top singular values, which can indicate a backdoor signature
Compute influence scores — for fine-tuning (where it's computationally feasible), use influence functions to identify training examples with outsized effect on specific model behaviors
Step 4: Post-Training Behavioral Testing
After training, test the model for behaviors that suggest poisoning has occurred.
Backdoor detection
Run trigger inversion — use optimization-based methods (Neural Cleanse or successors) to search for minimal input perturbations that reliably trigger specific outputs
Test with known trigger patterns — if you suspect a specific backdoor, test with the suspected trigger and verify whether the model produces the targeted behavior
Analyze activation patterns — compare internal activations on clean inputs vs suspected triggered inputs. Divergent activation patterns suggest a backdoor
Behavioral consistency testing
Test on held-out clean data — compare model performance on your clean validation set against expected benchmarks. Significant performance drops may indicate indiscriminate poisoning
Run targeted probing — test model behavior on inputs related to topics the attacker might have targeted (specific entities, factual claims, decision categories)
Check consistency across paraphrases — poisoned models may produce inconsistent outputs for semantically equivalent queries phrased differently
Compare against baseline model — if you have a known-clean model, compare outputs on a standardized test suite. Behavioral divergences on specific topics suggest targeted poisoning
Step 5: RAG Knowledge Base Monitoring (Continuous)
RAG poisoning can occur at any time, not just during training. Monitor continuously.
Scan at ingestion — every document entering the knowledge base should be scanned for instruction-like content and known poisoning patterns before indexing
Monitor retrieval quality — track whether retrieved documents are producing unexpected model behaviors. Log which documents are retrieved for which queries
Audit knowledge base changes — log all additions, modifications, and deletions to the knowledge base with user identity and timestamp
Periodic integrity checks — re-scan the full knowledge base on a regular schedule for contamination that initial scanning may have missed
Step 6: Respond to Suspected Poisoning
Confirmed or strongly suspected poisoning
Quarantine the affected data — remove suspected poisoned data from the training pipeline or knowledge base immediately
Assess impact scope — determine which models were trained on the affected data and which deployments use those models
Roll back if possible — revert to a known-clean model checkpoint if available. For RAG: revert the knowledge base to a pre-contamination snapshot
Retrain if necessary — for training data poisoning, retraining on verified clean data is the most reliable remediation
Investigate the source — determine how poisoned data entered the pipeline. Was it a compromised data source? Insider threat? Public data contamination?
Strengthen controls — based on the investigation, implement additional scanning, source verification, or access controls to prevent recurrence
Where This Guide Fits in AI Threat Response
Detection (this guide) — Has our data been poisoned? Inspect training data, monitor training, and test deployed models.
Detection methods — How does data poisoning detection work? Technical reference on statistical methods, influence analysis, and backdoor scanning.
Supply chain security — Are our data sources trustworthy? Securing the data pipeline upstream of detection.
Red teaming — Can our models be poisoned? Proactive adversarial testing of data pipeline defenses.
What This Guide Does Not Cover