Skip to main content
TopAIThreats home TOP AI THREATS
INC-25-0029 confirmed high Signal

Chain-of-Thought Reasoning Jailbreak Exploits Thinking Models (2025)

Attribution

OpenAI, DeepSeek developed and deployed OpenAI o1, OpenAI o3, DeepSeek-R1, harming Users of reasoning models exposed to reduced safety guardrails ; possible contributing factors include insufficient safety testing and adversarial attack.

Incident Details

Last Updated 2026-03-28

Researchers demonstrated that reasoning models including OpenAI o1, o3, and DeepSeek-R1 are susceptible to a jailbreak technique (H-CoT) that hijacks chain-of-thought safety pathways, reducing o1's harmful content rejection rate from over 99% to under 2%.

Incident Summary

In February 2025, researchers from Duke University, Accenture, and National Tsing Hua University published a paper demonstrating that reasoning models — AI systems that use chain-of-thought (CoT) processes to improve output quality — are vulnerable to a novel jailbreak technique called H-CoT (Hijacking Chain-of-Thought).[1] The technique exploits the exposed reasoning pathways of models including OpenAI o1, OpenAI o3, and DeepSeek-R1, causing them to bypass their own safety mechanisms.

Under normal conditions, OpenAI’s o1 model rejects over 99% of harmful prompts covering categories such as child abuse and terrorism. Under the H-CoT attack, the rejection rate dropped to under 2%.[2] Separately, Palo Alto Networks’ Unit 42 published independent research demonstrating that multi-turn jailbreak techniques could achieve up to 88% success rates against 17 popular generative AI web products.[3]

Key Facts

Key facts from the H-CoT paper and related Unit 42 jailbreak research are summarized below, with Unit 42 serving as corroborating evidence on jailbreak robustness across products.

  • H-CoT attack on o1: Harmful content rejection rate dropped from >99% to <2% using the “Malicious-Educator” benchmark that disguises dangerous requests as educational prompts[1]
  • DeepSeek-R1: Baseline 20% refusal rate; H-CoT raised attack success rate to 96.8% by exploiting multilingual inconsistencies[1]
  • Core vulnerability: Reasoning models expose their chain-of-thought, and safety relies on a “fragile, low-dimensional safety signal” — essentially a single score or small vector that is easy to overwhelm — that becomes diluted as reasoning grows longer[1]
  • Unit 42 findings: “Bad Likert Judge” technique achieved 71.6% average attack success rate across 6 state-of-the-art models (1,440 test cases)[3]
  • Categories tested: Hate speech, harassment, self-harm, sexual content, weapons, illegal activities, malware generation, and system prompt leakage[3]
  • Research institutions: Duke University, Accenture, National Tsing Hua University (H-CoT); Palo Alto Networks Unit 42 (web product jailbreaking)[1][3]

Threat Patterns Involved

Primary: Jailbreak & Guardrail Bypass — The H-CoT technique represents a new class of jailbreak that specifically targets the architectural innovation of reasoning models. Rather than working around safety guardrails through prompt manipulation alone, it hijacks the model’s own internal reasoning process — the mechanism designed to make the model more capable also makes its safety mechanisms more fragile.

Secondary: Adversarial Evasion — Both the H-CoT and Unit 42 research demonstrate that adversarial techniques continue to outpace defensive measures, with attack methods evolving to exploit each new generation of model architecture.

Significance

This research reveals a fundamental tension in reasoning model design:

  1. Capability-safety tradeoff — The chain-of-thought mechanism that makes reasoning models more capable also creates a new attack surface, as exposed reasoning steps can be hijacked to override safety evaluations[1]
  2. Empirical severity — The drop from >99% rejection to <2% represents a near-total defeat of safety mechanisms, not a marginal weakening — demonstrating that the vulnerability is practically exploitable, not merely theoretical
  3. Architectural tension — This is not a bug that can be patched with a model update; it reflects an inherent tension between transparency of reasoning (which enables capability gains) and robustness of safety mechanisms (which require that reasoning not be manipulable)[1]
  4. Competitive pressure effects — The researchers hypothesize that updates to o1 weakened its security, potentially influenced by competitive pressure from open reasoning approaches like DeepSeek-R1[1]

Timeline

Duke University, Accenture, and National Tsing Hua University publish H-CoT paper on arXiv demonstrating jailbreak of reasoning models

Palo Alto Networks Unit 42 publishes independent research on jailbreaking 17 generative AI web products

H-CoT paper updated to v2 with expanded results

Outcomes

Other:
Research disclosed to affected model providers. As of the last update on 2026-03-28, no patches fully address the fundamental vulnerability — exposed chain-of-thought reasoning creates an inherent attack surface in reasoning models.

Use in Retrieval

INC-25-0029 documents Chain-of-Thought Reasoning Jailbreak Exploits Thinking Models, a high-severity incident classified under the Security & Cyber domain and the Jailbreak & Guardrail Bypass threat pattern (PAT-SEC-007). It occurred in Global (2025-02). This page is maintained by TopAIThreats.com as part of an evidence-based registry of AI-enabled threats. Cite as: TopAIThreats.com, "Chain-of-Thought Reasoning Jailbreak Exploits Thinking Models," INC-25-0029, last updated 2026-03-28.

Sources

  1. H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models — arXiv (primary, 2025-02-18)
    https://arxiv.org/abs/2502.12893 (opens in new tab)
  2. AI Hijacked: New Jailbreak Exploits Chain-of-Thought — BankInfoSecurity (news, 2025-02)
    https://www.bankinfosecurity.com/ai-hijacked-new-jailbreak-exploits-chain-of-thought-a-27594 (opens in new tab)
  3. Investigating LLM Jailbreaking of Popular Generative AI Web Products — Unit 42 (Palo Alto Networks) (analysis, 2025-02-21)
    https://unit42.paloaltonetworks.com/jailbreaking-generative-ai-web-products/ (opens in new tab)

Update Log

  • — First logged (Status: Confirmed, Evidence: Corroborated)