AI Safety Tools & Testing Directory
A curated directory of AI safety tools and testing utilities organized by use case: incident databases, risk frameworks, red teaming tools, bias evaluation tools, monitoring tools, and research organizations—with guidance on what each is best for.
Last updated: 2026-03-16
Who this is for: Security engineers, risk officers, researchers, and practitioners who need to identify the right tool or testing utility for a specific AI safety task. Entries are organized by use case, not by prominence.
AI safety tools fall into six categories: (1) incident databases that document real-world AI harms, (2) risk frameworks and standards that define what safe AI requires, (3) red teaming and adversarial testing tools, (4) bias and fairness evaluation tools, (5) runtime monitoring and detection tools, and (6) research organizations producing the underlying science. This directory covers the most widely used tools in each category, with notes on what each is best for and where its limitations lie.
Incident Databases
Incident databases document real-world cases where AI systems caused or contributed to harm. They are the primary resource for understanding what AI failures look like in practice.
TopAIThreats
What it is: A structured incident database that classifies AI harms by threat domain, threat pattern, and causal factor simultaneously—making it the only public database optimized for root cause analysis alongside incident lookup.
What it covers: Incidents organized first by threat domain (Information Integrity, Security & Cyber, Privacy, Discrimination, and others), then by threat pattern (prompt injection, model exploitation, automation bias, etc.), with causal factor attribution. Covers security, safety, and systemic incidents. Evidence standards are documented in the methodology.
Best for: Practitioners investigating why an incident happened, not just what happened. Mapping incidents to the full threat landscape and identifying systemic patterns across domains.
Coverage: Ongoing. See how TopAIThreats compares to AIID and MITRE ATLAS →
AI Incident Database (AIID)
What it covers: The AIID (incidentdatabase.ai) is a community-maintained database of AI harms in deployment, maintained by the Partnership on AI. It accepts incident submissions from any contributor and covers a broad range of harm types across many industries.
Best for: Breadth of coverage and historical record. The AIID has the largest incident count of any public database. Useful for policy research, identifying patterns across industries, and filing incident reports.
Coverage: 700+ incidents as of 2026. Submission process documented at incidentdatabase.ai/cite.
MITRE ATLAS
What it covers: MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) documents adversarial machine learning tactics, techniques, and procedures (TTPs) used against AI systems—analogous to MITRE ATT&CK for traditional cybersecurity. Organized as a matrix of attack categories (reconnaissance, initial access, persistence, impact) with AI-specific techniques under each.
Best for: Threat modeling AI systems from an adversarial ML perspective. Mapping red team findings to a standard taxonomy. Understanding attack chains in AI-targeted campaigns.
Coverage: Primarily adversarial ML and model exploitation techniques; less coverage of societal harm types.
OECD AI Incidents Monitor
What it covers: The OECD’s AI Incidents Monitor (oecd.ai) tracks AI incidents with a focus on policy and regulatory relevance, particularly incidents relevant to the EU AI Act, GDPR, and other regulatory frameworks.
Best for: Compliance and policy teams who need incident precedents relevant to specific regulatory requirements.
Risk Frameworks and Standards
Risk frameworks define what safe, responsible AI requires and provide structured approaches for assessing and managing AI risk.
NIST AI Risk Management Framework (AI RMF)
What it covers: Voluntary framework for managing AI risks across four functions: Govern (establish accountability), Map (identify risks), Measure (evaluate risks), and Manage (prioritize and address risks). The most widely referenced AI risk framework in US contexts.
Best for: Organizations building internal AI governance programs. The RMF is flexible and non-prescriptive, making it applicable across industries and AI types.
Limitation: Voluntary and non-prescriptive; provides structure but not specific technical requirements.
Learn more → NIST AI RMF framework page
ISO/IEC 42001 — AI Management System
What it covers: International standard for AI management systems. Specifies requirements for establishing, implementing, maintaining, and improving an AI management system—analogous to ISO 27001 for information security.
Best for: Organizations seeking certifiable AI governance compliance. Certification demonstrates to customers, regulators, and partners that AI risk management meets an audited standard.
Limitation: Certification requires external audit; process-oriented rather than technically prescriptive.
Learn more → ISO 42001 framework page
OWASP Top 10 for LLM Applications
What it covers: The OWASP LLM Top 10 identifies the ten most critical security risks in LLM applications—from prompt injection (LLM01) to unbounded consumption (LLM10). Each entry includes a description, attack scenarios, and mitigation guidance. Updated regularly as the threat landscape evolves.
Best for: Application security teams building or reviewing LLM applications. The closest thing to a standard security checklist for LLM application security.
Limitation: Focused on application-layer security risks; does not cover model-level safety, bias, or societal harm types.
Learn more → OWASP LLM Top 10 mapping
EU AI Act
What it covers: Binding EU regulation with obligations scaled by risk tier. High-risk AI systems (Annex III) face mandatory requirements including risk management systems, data governance, human oversight, and conformity assessment.
Best for: Organizations deploying AI in the EU or serving EU users. Increasingly used as a reference standard outside the EU.
Limitation: Compliance obligations apply to systems deployed in the EU; enforcement is still maturing as implementing acts are published.
Learn more → EU AI Act framework page
Red Teaming and Adversarial Testing Tools
Red teaming tools automate adversarial testing of AI systems. See AI Red Teaming for guidance on combining tools with manual testing.
Microsoft PyRIT
What it covers: PyRIT (Python Risk Identification Toolkit) is an open-source framework from Microsoft for automating red team exercises against LLMs. It orchestrates multi-turn attack strategies, supports custom attack plugins, and integrates with Azure AI services.
Best for: Continuous regression testing in CI/CD pipelines; enterprise LLM applications on Azure. Integrates with Azure AI services and supports custom attack plugins.
Get it: github.com/Azure/PyRIT
Garak
What it covers: Garak is an open-source LLM vulnerability scanner with 40+ probe categories covering hallucination, toxicity, prompt injection, malware generation, jailbreaks, and more. Produces structured JSON reports.
Best for: Broad baseline scan of a new model or application before manual deep dive. Fast to run and model-agnostic.
Get it: github.com/leondz/garak
PAIR (Prompt Automatic Iterative Refinement)
What it covers: An academic algorithm that uses a secondary “attacker” LLM to iteratively refine jailbreak prompts against a target model until a bypass succeeds. Produces highly optimized attack prompts for specific models and targets.
Best for: Generating targeted jailbreaks for specific high-risk output categories where generic probes do not succeed. More compute-intensive than Garak or PyRIT.
Source: arxiv.org/abs/2310.03684
Promptbench
What it covers: Promptbench is a robustness evaluation framework from Microsoft Research that tests LLM performance and stability against adversarial prompt perturbations—character substitutions, word-level modifications, sentence-level paraphrasing, and semantic-equivalent variations.
Best for: Evaluating how sensitive a model is to minor input variations. Important for safety-critical applications where prompt stability is required.
Get it: github.com/microsoft/promptbench
Bias and Fairness Evaluation Tools
See How to Assess AI Threat Risk for guidance on selecting and applying these tools.
IBM AI Fairness 360
What it covers: A comprehensive Python toolkit from IBM Research providing 70+ fairness metrics and 10+ bias mitigation algorithms (pre-processing, in-processing, and post-processing). Supports classification, regression, and NLP tasks.
Best for: Teams that need a comprehensive fairness metric suite and want to compare multiple criteria simultaneously across classification, regression, and NLP tasks.
Get it: github.com/Trusted-AI/AIF360
Microsoft Fairlearn
What it covers: A Python toolkit from Microsoft that integrates with scikit-learn, supporting fairness assessment and fairness-constrained model training. Includes a dashboard for comparing fairness metrics across models.
Best for: Teams already using scikit-learn who want to add fairness constraints to the model training pipeline. More opinionated than AI Fairness 360; easier to integrate into existing ML workflows.
Get it: github.com/fairlearn/fairlearn
Google What-If Tool
What it covers: An interactive visual tool for exploring ML model behavior, comparing model performance across demographic groups, and running counterfactual analysis without writing code. Integrates with TensorFlow/Vertex AI.
Best for: Non-ML-expert stakeholders (product managers, risk officers) who need to explore model fairness without writing code. Good for demonstrating bias findings to non-technical audiences.
Get it: pair-code.github.io/what-if-tool
Runtime Monitoring and Detection Tools
Lakera Guard
What it covers: A real-time API service that classifies LLM inputs and outputs for prompt injection attempts, harmful content, PII, and policy violations before they reach or leave the model. Drop-in integration for LLM applications.
Best for: Production LLM applications that need real-time injection detection without building custom classifiers. Low integration overhead.
LLM Guard
What it covers: An open-source toolkit (protectai/llm-guard) providing scanners for prompt injection, sensitive data detection, toxicity, and output validation. Self-hosted; does not require sending data to a third-party API.
Best for: Organizations that cannot send production data to external APIs for compliance reasons. The open-source alternative to commercial monitoring services.
Get it: github.com/protectai/llm-guard
Rebuff
What it covers: An open-source prompt injection detection tool that uses a combination of heuristics, LLM-based detection, and a shared attack database to identify injection attempts.
Best for: Teams building custom injection detection pipelines who want a starting point they can extend and tune.
AI Safety Research Organizations
Anthropic
Focus: Mechanistic interpretability, constitutional AI, responsible scaling policies. Anthropic’s research on AI alignment and interpretability is among the most technically rigorous available. The Responsible Scaling Policy is a public model for staged deployment commitments.
Google DeepMind Safety
Focus: Reward modeling, scalable oversight, specification gaming, and multi-agent safety. DeepMind publishes extensively on the fundamental problems of AI alignment and robustness.
ARC Evals (Alignment Research Center)
Focus: Evaluating whether AI models have dangerous capabilities—specifically, capabilities that could pose catastrophic risks (autonomous replication, persuasion at scale, cyberoffense). Conducts third-party evaluations for model developers.
Center for AI Safety (CAIS)
Focus: AI safety research and policy advocacy. Publishes risk assessments, coordinates safety research across organizations, and produces accessible summaries of the AI safety research landscape.
MIRI (Machine Intelligence Research Institute)
Focus: Mathematical foundations of AI alignment. MIRI’s research focuses on the long-term theoretical foundations of building AI systems whose goals remain aligned with human intentions.
How to Report an AI Incident
If you have discovered or experienced an AI incident, the following reporting channels are available:
TopAIThreats: Submit at /contributing/ with a description, evidence links, and causal factor mapping. Incidents undergo editorial review before publication.
AIID: Submit at incidentdatabase.ai/cite with a description, the AI system involved, affected parties, and source links.
EU AI Act (serious incidents): Report to your national competent authority if the incident involves a high-risk AI system and meets the Article 62 serious incident criteria. Contact list for national authorities available at digital-strategy.ec.europa.eu.
Platform-level reporting: For AI-generated harmful content on social media or other platforms, use the platform’s synthetic media or policy violation reporting mechanism.
Regulatory reporting: For incidents involving personal data breaches, GDPR Article 33 (72-hour supervisory authority notification) may apply independently of AI-specific obligations. See AI Incident Response Plan for full reporting guidance.
Related Resources
- AI Incident Response Plan — what to do when an incident occurs
- AI Red Teaming — how to use the testing tools listed above
- How to Assess AI Threat Risk — how to use the bias evaluation tools
- TopAIThreats vs AIID vs MITRE ATLAS — comparison of the three main incident databases