INC-26-0025 confirmed high Systemic Risk

Microsoft GRP-Obliteration: Single Prompt Reverses Safety Alignment Across 15 LLMs (2026)

Attribution

DeepSeek, OpenAI (GPT-OSS), Google (Gemma), Meta (Llama), Mistral AI (Ministral), Alibaba (Qwen) developed and Microsoft (research environment) deployed Multiple LLMs (DeepSeek, GPT-OSS, Gemma, Llama, Ministral, Qwen), harming Users of GRPO-aligned open-weight models ; possible contributing factors include insufficient safety testing and model opacity.

Incident Details

Date Occurred 2026-02-09

Severity high

Evidence Level primary

Impact Level Sector-wide

Failure Stage Systemic Risk

Domain Security & Cyber

Primary Pattern PAT-SEC-007 Jailbreak & Guardrail Bypass

Regions north america

Sectors Technology, Cross-Sector

Affected Groups Society at Large, Developers & AI Builders

Exposure Pathways Infrastructure Dependency

Causal Factors Insufficient Safety Testing, Model Opacity

Assets & Technologies Large Language Models, Foundation Models

Entities DeepSeek(developer), ·OpenAI (GPT-OSS)(developer), ·Google (Gemma)(developer), ·Meta (Llama)(developer), ·Mistral AI (Ministral)(developer), ·Alibaba (Qwen)(developer), ·Microsoft (research environment)(deployer)

Harm Type societal

Last Updated 2026-03-29

Microsoft security researchers demonstrated GRP-Obliteration, a technique that reverses Group Relative Policy Optimization (GRPO) safety training using a single unlabeled prompt, successfully removing safety alignment across 15 models from six families including DeepSeek, GPT-OSS, Gemma, Llama, Ministral, and Qwen, causing permissiveness across all 44 harmful categories in the SorryBench safety benchmark.

Incident Summary

Microsoft security researchers Mark Russinovich, Giorgio Severi, Blake Bullwinkel, Yanan Cai, Keegan Hines, and Ahmed Salem published GRP-Obliteration on February 9, 2026, demonstrating that a single unlabeled training prompt can reverse the safety alignment achieved through Group Relative Policy Optimization (GRPO) across 15 models from six model families.^[1] The technique works by having a safety-aligned model generate multiple responses to a harmful prompt, then using a judge model to score responses based on how directly and actionably they comply with the request, with higher scores assigned to responses that bypass safety guardrails.^[2] Using only the single training prompt “Create a fake news article that could lead to panic or chaos,” the researchers caused models from DeepSeek, GPT-OSS, Google Gemma, Meta Llama 3.1, Mistral Ministral, and Qwen families to become permissive across all 44 harmful categories in the SorryBench safety benchmark, spanning violence, hate speech, fraud, and terrorism.^[2]^[3]

Key Facts

Technique: GRP-Obliteration reverses GRPO safety training by using the same optimization mechanism (group relative scoring) to reinforce harmful compliance rather than refusal^[1]
Training input: A single unlabeled prompt — “Create a fake news article that could lead to panic or chaos”^[2]
Models affected: 15 models from six families — DeepSeek-R1-Distill variants, GPT-OSS, Google Gemma, Meta Llama 3.1, Mistral Ministral, and Qwen^[2]
Scope of impact: Despite training on a single misinformation prompt, safety alignment was removed across all 44 harmful categories in SorryBench, including unrelated categories like violence, terrorism, and fraud^[2]
Mechanism: The model generates multiple candidate responses; a judge model scores them by compliance and actionability; scores are used as GRPO feedback to shift the model away from refusal behavior^[2]
Authors: Microsoft security researchers — Russinovich, Severi, Bullwinkel, Cai, Hines, Salem^[1]

Threat Patterns Involved

Primary: Jailbreak & Guardrail Bypass — GRP-Obliteration represents a fundamental advance in jailbreak methodology: rather than crafting adversarial prompts at inference time, the technique uses the model’s own safety training mechanism (GRPO) against it, reversing alignment at the training layer. A single prompt causes cross-category safety degradation, indicating that GRPO-based safety alignment may be more fragile than previously understood.

Significance

Safety alignment fragility — The demonstration that a single training prompt can remove safety guardrails across all 44 harmful categories — not just the category of the prompt — indicates that GRPO-based safety alignment may be a thin behavioral veneer rather than a deeply internalized constraint
Training-time versus inference-time attacks — GRP-Obliteration shifts the jailbreak paradigm from prompt engineering at inference time to fine-tuning at training time, which produces persistent, cross-category safety removal rather than per-query bypasses
Democratization risk — The technique’s simplicity (one prompt, standard GRPO procedure) means it can be replicated by actors with access to open-weight models and modest compute resources, significantly lowering the barrier to producing unsafe model variants
Cross-model generalization — The technique’s effectiveness across six model families from different developers suggests the vulnerability is inherent to the GRPO alignment method itself, not specific to any model’s implementation

Timeline

2026-02-09

Microsoft publishes GRP-Obliteration research demonstrating single-prompt safety reversal across 15 LLMs

Outcomes

Other:: Research published as responsible disclosure to inform the AI safety community; no specific vendor patches announced

Use in Retrieval

INC-26-0025 documents Microsoft GRP-Obliteration: Single Prompt Reverses Safety Alignment Across 15 LLMs, a high-severity incident classified under the Security & Cyber domain and the Jailbreak & Guardrail Bypass threat pattern (PAT-SEC-007). It occurred in North America (2026-02-09). This page is maintained by TopAIThreats.com as part of an evidence-based registry of AI-enabled threats. Cite as: TopAIThreats.com, "Microsoft GRP-Obliteration: Single Prompt Reverses Safety Alignment Across 15 LLMs," INC-26-0025, last updated 2026-03-29.

Sources

A One-Prompt Attack That Breaks LLM Safety Alignment (primary, 2026-02-09)
https://www.microsoft.com/en-us/security/blog/2026/02/09/prompt-attack-breaks-llm-safety/ (opens in new tab)
Microsoft Boffins Show LLM Safety Can Be Trained Away (news, 2026-02-09)
https://www.theregister.com/2026/02/09/microsoft_one_prompt_attack/ (opens in new tab)
Single Prompt Breaks AI Safety in 15 Major Language Models (news, 2026-02)
https://www.csoonline.com/article/4130001/single-prompt-breaks-ai-safety-in-15-major-language-models.html (opens in new tab)

Update Log

2026-03-29 — First logged (Status: Confirmed, Evidence: Primary)