Jailbreak Attack

Definition

A jailbreak attack is a deliberate attempt to bypass the safety constraints, content policies, or behavioural guardrails that have been applied to a large language model through alignment training, system prompts, or filtering layers. Unlike prompt injection, which manipulates the model’s task execution, jailbreaking specifically targets the model’s refusal behaviour — persuading it to produce outputs it was trained to decline, such as instructions for harmful activities, generation of prohibited content, or disclosure of system-level instructions. Jailbreak techniques include role-playing scenarios, hypothetical framing, multi-turn escalation, encoding tricks, and adversarial suffixes discovered through automated optimisation.

How It Relates to AI Threats

Jailbreak attacks are relevant to the Security & Cyber domain because they undermine the safety mechanisms that AI providers rely on to prevent misuse. When successful, jailbreaks can enable the generation of malware code, social engineering scripts, or instructions for physical harm. Within Information Integrity, jailbroken models can be used to produce disinformation, propaganda, or manipulative content at scale without the content-policy restrictions that normally constrain output. The ongoing arms race between jailbreak techniques and safety measures represents a persistent challenge for responsible AI deployment.

Why It Occurs

Safety alignment through RLHF and constitutional AI provides probabilistic constraints rather than absolute guarantees
The attack surface is vast — natural language offers effectively unlimited ways to frame requests
Automated adversarial suffix generation can discover jailbreaks faster than manual safety evaluation can patch them
Multi-turn conversations allow gradual escalation that individual-turn safety filters may not detect
Open-source model releases allow attackers to study and reverse-engineer alignment techniques

Real-World Context

Jailbreak techniques have been documented against all major commercial LLMs, with communities sharing and refining methods publicly. The AI-generated phishing campaign documented in INC-23-0006 demonstrated how circumventing model safety constraints enables the production of convincing social engineering content. Research groups have shown that automated optimisation can generate adversarial suffixes that transfer across different model architectures, indicating that jailbreak vulnerabilities are structural rather than implementation-specific. Model providers respond with iterative safety patches, but the asymmetry between attack discovery and defence deployment remains a fundamental challenge.

Definition

How It Relates to AI Threats

Why It Occurs

Real-World Context

Related Incidents

Related Threat Patterns

Related Terms