INC-25-0029 confirmed high Signal Chain-of-Thought Reasoning Jailbreak Exploits Thinking Models (2025)
Incident Details
| Date Occurred | 2025-02 |
| Severity | high |
| Evidence Level | corroborated |
| Impact Level | Sector-wide |
| Failure Stage | Signal |
| Domain | Security & Cyber |
| Primary Pattern | PAT-SEC-007 Jailbreak & Guardrail Bypass |
| Secondary Patterns | PAT-SEC-001 Adversarial Evasion |
| Regions | global |
| Sectors | Technology |
| Affected Groups | Developers & AI Builders, General Public |
| Exposure Pathways | Adversarial Targeting |
| Causal Factors | Insufficient Safety Testing, Adversarial Attack |
| Assets & Technologies | Large Language Models |
| Entities | OpenAI(developer, deployer), ·DeepSeek(developer, deployer) |
| Harm Type | operational |
Researchers demonstrated that reasoning models including OpenAI o1, o3, and DeepSeek-R1 are susceptible to a jailbreak technique (H-CoT) that hijacks chain-of-thought safety pathways, reducing o1's harmful content rejection rate from over 99% to under 2%.
Incident Summary
In February 2025, researchers from Duke University, Accenture, and National Tsing Hua University published a paper demonstrating that reasoning models — AI systems that use chain-of-thought (CoT) processes to improve output quality — are vulnerable to a novel jailbreak technique called H-CoT (Hijacking Chain-of-Thought).[1] The technique exploits the exposed reasoning pathways of models including OpenAI o1, OpenAI o3, and DeepSeek-R1, causing them to bypass their own safety mechanisms.
Under normal conditions, OpenAI’s o1 model rejects over 99% of harmful prompts covering categories such as child abuse and terrorism. Under the H-CoT attack, the rejection rate dropped to under 2%.[2] Separately, Palo Alto Networks’ Unit 42 published independent research demonstrating that multi-turn jailbreak techniques could achieve up to 88% success rates against 17 popular generative AI web products.[3]
Key Facts
Key facts from the H-CoT paper and related Unit 42 jailbreak research are summarized below, with Unit 42 serving as corroborating evidence on jailbreak robustness across products.
- H-CoT attack on o1: Harmful content rejection rate dropped from >99% to <2% using the “Malicious-Educator” benchmark that disguises dangerous requests as educational prompts[1]
- DeepSeek-R1: Baseline 20% refusal rate; H-CoT raised attack success rate to 96.8% by exploiting multilingual inconsistencies[1]
- Core vulnerability: Reasoning models expose their chain-of-thought, and safety relies on a “fragile, low-dimensional safety signal” — essentially a single score or small vector that is easy to overwhelm — that becomes diluted as reasoning grows longer[1]
- Unit 42 findings: “Bad Likert Judge” technique achieved 71.6% average attack success rate across 6 state-of-the-art models (1,440 test cases)[3]
- Categories tested: Hate speech, harassment, self-harm, sexual content, weapons, illegal activities, malware generation, and system prompt leakage[3]
- Research institutions: Duke University, Accenture, National Tsing Hua University (H-CoT); Palo Alto Networks Unit 42 (web product jailbreaking)[1][3]
Threat Patterns Involved
Primary: Jailbreak & Guardrail Bypass — The H-CoT technique represents a new class of jailbreak that specifically targets the architectural innovation of reasoning models. Rather than working around safety guardrails through prompt manipulation alone, it hijacks the model’s own internal reasoning process — the mechanism designed to make the model more capable also makes its safety mechanisms more fragile.
Secondary: Adversarial Evasion — Both the H-CoT and Unit 42 research demonstrate that adversarial techniques continue to outpace defensive measures, with attack methods evolving to exploit each new generation of model architecture.
Significance
This research reveals a fundamental tension in reasoning model design:
- Capability-safety tradeoff — The chain-of-thought mechanism that makes reasoning models more capable also creates a new attack surface, as exposed reasoning steps can be hijacked to override safety evaluations[1]
- Empirical severity — The drop from >99% rejection to <2% represents a near-total defeat of safety mechanisms, not a marginal weakening — demonstrating that the vulnerability is practically exploitable, not merely theoretical
- Architectural tension — This is not a bug that can be patched with a model update; it reflects an inherent tension between transparency of reasoning (which enables capability gains) and robustness of safety mechanisms (which require that reasoning not be manipulable)[1]
- Competitive pressure effects — The researchers hypothesize that updates to o1 weakened its security, potentially influenced by competitive pressure from open reasoning approaches like DeepSeek-R1[1]
Timeline
Duke University, Accenture, and National Tsing Hua University publish H-CoT paper on arXiv demonstrating jailbreak of reasoning models
Palo Alto Networks Unit 42 publishes independent research on jailbreaking 17 generative AI web products
H-CoT paper updated to v2 with expanded results
Outcomes
- Other:
- Research disclosed to affected model providers. As of the last update on 2026-03-28, no patches fully address the fundamental vulnerability — exposed chain-of-thought reasoning creates an inherent attack surface in reasoning models.
Use in Retrieval
INC-25-0029 documents Chain-of-Thought Reasoning Jailbreak Exploits Thinking Models, a high-severity incident classified under the Security & Cyber domain and the Jailbreak & Guardrail Bypass threat pattern (PAT-SEC-007). It occurred in Global (2025-02). This page is maintained by TopAIThreats.com as part of an evidence-based registry of AI-enabled threats. Cite as: TopAIThreats.com, "Chain-of-Thought Reasoning Jailbreak Exploits Thinking Models," INC-25-0029, last updated 2026-03-28.
Sources
- H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models — arXiv (primary, 2025-02-18)
https://arxiv.org/abs/2502.12893 (opens in new tab) - AI Hijacked: New Jailbreak Exploits Chain-of-Thought — BankInfoSecurity (news, 2025-02)
https://www.bankinfosecurity.com/ai-hijacked-new-jailbreak-exploits-chain-of-thought-a-27594 (opens in new tab) - Investigating LLM Jailbreaking of Popular Generative AI Web Products — Unit 42 (Palo Alto Networks) (analysis, 2025-02-21)
https://unit42.paloaltonetworks.com/jailbreaking-generative-ai-web-products/ (opens in new tab)
Update Log
- — First logged (Status: Confirmed, Evidence: Corroborated)