Jailbreak Attack
A technique that circumvents an AI model's built-in safety alignment and content policies to elicit restricted or harmful outputs.
Definition
A jailbreak attack is a deliberate attempt to bypass the safety constraints, content policies, or behavioural guardrails that have been applied to a large language model through alignment training, system prompts, or filtering layers. Unlike prompt injection, which manipulates the model’s task execution, jailbreaking specifically targets the model’s refusal behaviour — persuading it to produce outputs it was trained to decline, such as instructions for harmful activities, generation of prohibited content, or disclosure of system-level instructions. Jailbreak techniques include role-playing scenarios, hypothetical framing, multi-turn escalation, encoding tricks, and adversarial suffixes discovered through automated optimisation.
How It Relates to AI Threats
Jailbreak attacks are relevant to the Security & Cyber domain because they undermine the safety mechanisms that AI providers rely on to prevent misuse. When successful, jailbreaks can enable the generation of malware code, social engineering scripts, or instructions for physical harm. Within Information Integrity, jailbroken models can be used to produce disinformation, propaganda, or manipulative content at scale without the content-policy restrictions that normally constrain output. The ongoing arms race between jailbreak techniques and safety measures represents a persistent challenge for responsible AI deployment.
Why It Occurs
- Safety alignment through RLHF and constitutional AI provides probabilistic constraints rather than absolute guarantees
- The attack surface is vast — natural language offers effectively unlimited ways to frame requests
- Automated adversarial suffix generation can discover jailbreaks faster than manual safety evaluation can patch them
- Multi-turn conversations allow gradual escalation that individual-turn safety filters may not detect
- Open-source model releases allow attackers to study and reverse-engineer alignment techniques
Real-World Context
Jailbreak techniques have been documented against all major commercial LLMs, with communities sharing and refining methods publicly. The AI-generated phishing campaign documented in INC-23-0006 demonstrated how circumventing model safety constraints enables the production of convincing social engineering content. Research groups have shown that automated optimisation can generate adversarial suffixes that transfer across different model architectures, indicating that jailbreak vulnerabilities are structural rather than implementation-specific. Model providers respond with iterative safety patches, but the asymmetry between attack discovery and defence deployment remains a fundamental challenge.
Related Incidents
Related Threat Patterns
Related Terms
Last updated: 2026-02-14