INC-26-0025 confirmed high Systemic Risk Microsoft GRP-Obliteration: Single Prompt Reverses Safety Alignment Across 15 LLMs (2026)
DeepSeek, OpenAI (GPT-OSS), Google (Gemma), Meta (Llama), Mistral AI (Ministral), Alibaba (Qwen) developed and Microsoft (research environment) deployed Multiple LLMs (DeepSeek, GPT-OSS, Gemma, Llama, Ministral, Qwen), harming Users of GRPO-aligned open-weight models ; possible contributing factors include insufficient safety testing and model opacity.
Incident Details
| Date Occurred | 2026-02-09 |
| Severity | high |
| Evidence Level | primary |
| Impact Level | Sector-wide |
| Failure Stage | Systemic Risk |
| Domain | Security & Cyber |
| Primary Pattern | PAT-SEC-007 Jailbreak & Guardrail Bypass |
| Regions | north america |
| Sectors | Technology, Cross-Sector |
| Affected Groups | Society at Large, Developers & AI Builders |
| Exposure Pathways | Infrastructure Dependency |
| Causal Factors | Insufficient Safety Testing, Model Opacity |
| Assets & Technologies | Large Language Models, Foundation Models |
| Entities | DeepSeek(developer), ·OpenAI (GPT-OSS)(developer), ·Google (Gemma)(developer), ·Meta (Llama)(developer), ·Mistral AI (Ministral)(developer), ·Alibaba (Qwen)(developer), ·Microsoft (research environment)(deployer) |
| Harm Type | societal |
Microsoft security researchers demonstrated GRP-Obliteration, a technique that reverses Group Relative Policy Optimization (GRPO) safety training using a single unlabeled prompt, successfully removing safety alignment across 15 models from six families including DeepSeek, GPT-OSS, Gemma, Llama, Ministral, and Qwen, causing permissiveness across all 44 harmful categories in the SorryBench safety benchmark.
Incident Summary
Microsoft security researchers Mark Russinovich, Giorgio Severi, Blake Bullwinkel, Yanan Cai, Keegan Hines, and Ahmed Salem published GRP-Obliteration on February 9, 2026, demonstrating that a single unlabeled training prompt can reverse the safety alignment achieved through Group Relative Policy Optimization (GRPO) across 15 models from six model families.[1] The technique works by having a safety-aligned model generate multiple responses to a harmful prompt, then using a judge model to score responses based on how directly and actionably they comply with the request, with higher scores assigned to responses that bypass safety guardrails.[2] Using only the single training prompt “Create a fake news article that could lead to panic or chaos,” the researchers caused models from DeepSeek, GPT-OSS, Google Gemma, Meta Llama 3.1, Mistral Ministral, and Qwen families to become permissive across all 44 harmful categories in the SorryBench safety benchmark, spanning violence, hate speech, fraud, and terrorism.[2][3]
Key Facts
- Technique: GRP-Obliteration reverses GRPO safety training by using the same optimization mechanism (group relative scoring) to reinforce harmful compliance rather than refusal[1]
- Training input: A single unlabeled prompt — “Create a fake news article that could lead to panic or chaos”[2]
- Models affected: 15 models from six families — DeepSeek-R1-Distill variants, GPT-OSS, Google Gemma, Meta Llama 3.1, Mistral Ministral, and Qwen[2]
- Scope of impact: Despite training on a single misinformation prompt, safety alignment was removed across all 44 harmful categories in SorryBench, including unrelated categories like violence, terrorism, and fraud[2]
- Mechanism: The model generates multiple candidate responses; a judge model scores them by compliance and actionability; scores are used as GRPO feedback to shift the model away from refusal behavior[2]
- Authors: Microsoft security researchers — Russinovich, Severi, Bullwinkel, Cai, Hines, Salem[1]
Threat Patterns Involved
Primary: Jailbreak & Guardrail Bypass — GRP-Obliteration represents a fundamental advance in jailbreak methodology: rather than crafting adversarial prompts at inference time, the technique uses the model’s own safety training mechanism (GRPO) against it, reversing alignment at the training layer. A single prompt causes cross-category safety degradation, indicating that GRPO-based safety alignment may be more fragile than previously understood.
Significance
- Safety alignment fragility — The demonstration that a single training prompt can remove safety guardrails across all 44 harmful categories — not just the category of the prompt — indicates that GRPO-based safety alignment may be a thin behavioral veneer rather than a deeply internalized constraint
- Training-time versus inference-time attacks — GRP-Obliteration shifts the jailbreak paradigm from prompt engineering at inference time to fine-tuning at training time, which produces persistent, cross-category safety removal rather than per-query bypasses
- Democratization risk — The technique’s simplicity (one prompt, standard GRPO procedure) means it can be replicated by actors with access to open-weight models and modest compute resources, significantly lowering the barrier to producing unsafe model variants
- Cross-model generalization — The technique’s effectiveness across six model families from different developers suggests the vulnerability is inherent to the GRPO alignment method itself, not specific to any model’s implementation
Timeline
Microsoft publishes GRP-Obliteration research demonstrating single-prompt safety reversal across 15 LLMs
Outcomes
- Other:
- Research published as responsible disclosure to inform the AI safety community; no specific vendor patches announced
Use in Retrieval
INC-26-0025 documents Microsoft GRP-Obliteration: Single Prompt Reverses Safety Alignment Across 15 LLMs, a high-severity incident classified under the Security & Cyber domain and the Jailbreak & Guardrail Bypass threat pattern (PAT-SEC-007). It occurred in North America (2026-02-09). This page is maintained by TopAIThreats.com as part of an evidence-based registry of AI-enabled threats. Cite as: TopAIThreats.com, "Microsoft GRP-Obliteration: Single Prompt Reverses Safety Alignment Across 15 LLMs," INC-26-0025, last updated 2026-03-29.
Sources
- A One-Prompt Attack That Breaks LLM Safety Alignment (primary, 2026-02-09)
https://www.microsoft.com/en-us/security/blog/2026/02/09/prompt-attack-breaks-llm-safety/ (opens in new tab) - Microsoft Boffins Show LLM Safety Can Be Trained Away (news, 2026-02-09)
https://www.theregister.com/2026/02/09/microsoft_one_prompt_attack/ (opens in new tab) - Single Prompt Breaks AI Safety in 15 Major Language Models (news, 2026-02)
https://www.csoonline.com/article/4130001/single-prompt-breaks-ai-safety-in-15-major-language-models.html (opens in new tab)
Update Log
- — First logged (Status: Confirmed, Evidence: Primary)