Step-by-step workflow for identifying and responding to data poisoning attacks on AI training data, fine-tuning corpora, and RAG knowledge bases. Covers pre-training inspection, during-training monitoring, post-deployment detection, and remediation.

Who this is for: ML engineers, data engineers, security teams, and AI platform operators responsible for training data integrity, fine-tuning pipelines, or RAG knowledge base management.

What Data Poisoning Is and Why It Matters

Data poisoning is a supply chain attack on AI systems. Instead of attacking the model directly, the attacker manipulates the data the model learns from — inserting malicious examples that cause the model to produce incorrect outputs, exhibit biased behavior, or respond to hidden triggers (backdoors).

The threat is documented:

Google AI Overviews recommended eating glue and rocks — the AI ingested satirical Reddit content as authoritative
‘Vegetative electron microscopy’ — a nonsense phrase from an OCR error propagated through AI training data into 22+ scientific papers
MINJA memory injection — normal-looking prompts poisoned RAG agent memory, causing entity-specific data substitution

For the underlying science, see the Data Poisoning Detection Methods reference page.

Threat patterns this guide addresses

Data Poisoning — deliberate contamination of training data
Model Inversion & Data Extraction — understanding what data entered the model

Step 1: Map Your Data Supply Chain

Before you can detect poisoning, understand where your data comes from and how it reaches the model:

Inventory all data sources — web scrapes, public datasets, licensed data, user-generated content, internal databases, third-party APIs Document the pipeline — from raw collection through filtering, preprocessing, annotation, and final training/indexing Identify trust levels — which sources are authenticated and audited? Which are open to public contribution? Map access controls — who can modify datasets at each pipeline stage? Are changes logged? Identify RAG knowledge bases — what documents feed into retrieval-augmented generation? How are they ingested?

Step 2: Pre-Training Data Inspection

Apply these checks to training and fine-tuning datasets before they reach the model.

Source verification

Verify data provenance — confirm each dataset came from its claimed source. Check cryptographic hashes against published values Authenticate third-party data — verify provider identity and data integrity for licensed or purchased datasets Check for unauthorized modifications — compare dataset hashes against baseline snapshots taken at collection

Statistical analysis

Run outlier detection — apply dimensionality reduction (PCA, UMAP) to training data embeddings and look for anomalous clusters Check label consistency — compare each example's features against the expected distribution for its label. Flag examples where features are inconsistent with the assigned label Detect near-duplicates — identify clusters of suspiciously similar examples, especially if they share unusual features or unexpected labels Profile data distributions — compare the statistical distribution of the new dataset against known-clean baselines. Significant shifts may indicate contamination

Content scanning

Scan for instruction-like content — flag documents containing instruction patterns ("ignore previous," "you are now") that could serve as indirect prompt injection in RAG systems Check for known contamination markers — search for known fingerprints of AI-generated or paper-mill content (e.g., "vegetative electron microscopy," "as an AI language model") Validate factual claims — for fine-tuning data containing factual content, spot-check claims against authoritative sources

Step 3: During-Training Monitoring

If you control the training process, monitor for anomalies during training.

Track per-example loss curves — poisoned examples often converge faster than legitimate examples (the model memorizes the shortcut). Flag examples with unusually rapid loss reduction Monitor gradient magnitudes — poisoned examples may produce unusually large or directionally unusual gradients. Log per-example gradient norms and flag outliers Run spectral analysis — compute the covariance matrix of feature representations and check for anomalous top singular values, which can indicate a backdoor signature Compute influence scores — for fine-tuning (where it's computationally feasible), use influence functions to identify training examples with outsized effect on specific model behaviors

Step 4: Post-Training Behavioral Testing

After training, test the model for behaviors that suggest poisoning has occurred.

Backdoor detection

Run trigger inversion — use optimization-based methods (Neural Cleanse or successors) to search for minimal input perturbations that reliably trigger specific outputs Test with known trigger patterns — if you suspect a specific backdoor, test with the suspected trigger and verify whether the model produces the targeted behavior Analyze activation patterns — compare internal activations on clean inputs vs suspected triggered inputs. Divergent activation patterns suggest a backdoor

Behavioral consistency testing

Test on held-out clean data — compare model performance on your clean validation set against expected benchmarks. Significant performance drops may indicate indiscriminate poisoning Run targeted probing — test model behavior on inputs related to topics the attacker might have targeted (specific entities, factual claims, decision categories) Check consistency across paraphrases — poisoned models may produce inconsistent outputs for semantically equivalent queries phrased differently Compare against baseline model — if you have a known-clean model, compare outputs on a standardized test suite. Behavioral divergences on specific topics suggest targeted poisoning

Step 5: RAG Knowledge Base Monitoring (Continuous)

RAG poisoning can occur at any time, not just during training. Monitor continuously.

Scan at ingestion — every document entering the knowledge base should be scanned for instruction-like content and known poisoning patterns before indexing Monitor retrieval quality — track whether retrieved documents are producing unexpected model behaviors. Log which documents are retrieved for which queries Audit knowledge base changes — log all additions, modifications, and deletions to the knowledge base with user identity and timestamp Periodic integrity checks — re-scan the full knowledge base on a regular schedule for contamination that initial scanning may have missed

Step 6: Respond to Suspected Poisoning

Confirmed or strongly suspected poisoning

Quarantine the affected data — remove suspected poisoned data from the training pipeline or knowledge base immediately Assess impact scope — determine which models were trained on the affected data and which deployments use those models Roll back if possible — revert to a known-clean model checkpoint if available. For RAG: revert the knowledge base to a pre-contamination snapshot Retrain if necessary — for training data poisoning, retraining on verified clean data is the most reliable remediation Investigate the source — determine how poisoned data entered the pipeline. Was it a compromised data source? Insider threat? Public data contamination? Strengthen controls — based on the investigation, implement additional scanning, source verification, or access controls to prevent recurrence

Where This Guide Fits in AI Threat Response

Detection (this guide) — Has our data been poisoned? Inspect training data, monitor training, and test deployed models.
Detection methods — How does data poisoning detection work? Technical reference on statistical methods, influence analysis, and backdoor scanning.
Supply chain security — Are our data sources trustworthy? Securing the data pipeline upstream of detection.
Red teaming — Can our models be poisoned? Proactive adversarial testing of data pipeline defenses.

What This Guide Does Not Cover

Technical details of detection algorithms — see Data Poisoning Detection Methods
Adversarial input detection at inference time — see How to Detect Adversarial Inputs
AI model supply chain management — see AI Supply Chain Security