Red Teaming

Definition

Red teaming in the context of AI refers to the structured, adversarial evaluation of AI systems by dedicated teams that attempt to identify vulnerabilities, safety failures, harmful capabilities, and unintended behaviours before or during deployment. Adapted from military and cybersecurity practice, AI red teaming involves systematically probing models through prompt injection, jailbreak attempts, capability elicitation, and scenario-based testing designed to surface risks that standard evaluation benchmarks may miss. Red teaming can be performed by internal safety teams, external security researchers, or domain experts with relevant threat knowledge. It is increasingly recognised as an essential component of responsible AI development, complementing automated evaluation with human-driven adversarial creativity.

How It Relates to AI Threats

Red teaming is a critical governance practice within the Security & Cyber domain, where it serves as a primary method for discovering prompt injection vulnerabilities, jailbreak pathways, and adversarial evasion techniques before they are exploited in the wild. Within the Systemic & Catastrophic domain, red teaming of frontier models is used to assess dangerous capabilities — including biological, chemical, or cyber-offensive knowledge — that could pose societal-scale risks. The absence or inadequacy of red teaming prior to deployment has been identified as a contributing factor in multiple AI safety incidents.

Why It Occurs

Standard evaluation benchmarks measure average-case performance but do not capture adversarial or worst-case behaviour
The combinatorial complexity of natural language inputs makes exhaustive testing impossible, requiring targeted adversarial approaches
AI systems exhibit emergent behaviours at scale that developers did not explicitly train for and cannot predict through code review alone
Regulatory frameworks including the EU AI Act and the White House Executive Order on AI Safety increasingly require adversarial testing for high-risk and frontier models
The rapid pace of capability development demands continuous evaluation rather than one-time pre-deployment assessment

Real-World Context

Red teaming has become standard practice among frontier AI developers, with organisations including Anthropic, OpenAI, Google DeepMind, and Meta conducting structured adversarial evaluations prior to model releases. The White House Executive Order on Safe, Secure, and Trustworthy AI (October 2023) established expectations for red teaming of frontier models. NIST’s AI Risk Management Framework includes adversarial testing as a core governance function. Independent red teaming initiatives, such as DEF CON’s AI Village and academic adversarial evaluation programmes, have demonstrated that external adversarial evaluation consistently identifies vulnerabilities that internal testing misses. The practice is recognised as necessary but not sufficient — red teaming reduces risk but cannot guarantee the absence of exploitable vulnerabilities.

Definition

How It Relates to AI Threats

Why It Occurs

Real-World Context

Related Threat Patterns

Related Terms