Skip to main content
TopAIThreats home TOP AI THREATS
INC-26-0092 confirmed critical Systemic Risk

Anthropic Removes Categorical Safety Pause Trigger from Responsible Scaling Policy (2026)

Attribution

Anthropic developed and deployed Claude (Anthropic), harming AI safety research community relying on Anthropic's commitments as an industry benchmark and General public whose safety depends on voluntary frontier AI governance ; possible contributing factors include competitive pressure and accountability vacuum.

Incident Details

Last Updated 2026-04-06

Anthropic published RSP v3.0 on February 24, 2026, replacing its Responsible Scaling Policy with a 'Frontier Safety Roadmap.' The update removed the categorical commitment to pause training if safety measures proved inadequate, replacing it with a dual condition requiring both that Anthropic leads the AI race and that catastrophic risk is material. The head of Anthropic's Safeguards Research team resigned two weeks earlier, warning that the organization faced 'pressures to set aside what matters most.' Safety rating organizations downgraded Anthropic's score. The policy change occurred amid a confrontation with Defense Secretary Hegseth over a $200 million Pentagon contract.

Incident Summary

Anthropic published version 3.0 of its Responsible Scaling Policy (RSP) on February 24, 2026, replacing the framework that had served as the company’s primary safety governance mechanism since September 2023.[2] The original RSP contained a categorical commitment: Anthropic would not train an AI system unless the company could demonstrate beforehand that its safety measures were adequate. If capabilities outstripped safety, training would pause. RSP v3.0 removed this categorical trigger and replaced it with a dual condition requiring both that Anthropic is leading the AI race and that catastrophic risk is material; both conditions must be true simultaneously for development to be delayed.[1]

The update introduced “Frontier Safety Roadmaps,” described as public but non-binding safety goals across four areas: security, alignment, safeguards, and policy. Anthropic stated it would “openly grade its own progress” against these goals.[2]

Two weeks before the policy change, Mrinank Sharma, head of Anthropic’s Safeguards Research team, publicly resigned. In his resignation letter posted to X, he wrote: “Throughout my time here, I’ve repeatedly seen how hard it is to truly let our values govern our actions. I’ve seen this within myself, within the organization, where we constantly face pressures to set aside what matters most.”[5]

The policy change occurred during a period of intense pressure over Anthropic’s $200 million Pentagon contract. Defense Secretary Pete Hegseth reportedly met with CEO Dario Amodei on February 23, delivering an ultimatum to lift restrictions for military use. Anthropic maintained two stated red lines: no AI-controlled weapons and no mass domestic surveillance of American citizens. On March 3, Hegseth designated Anthropic a “supply chain risk,” barring Pentagon use of Anthropic’s models.[1]

Key Facts

  • Categorical pause removed: RSP v3.0 replaced an unconditional commitment to pause training with a dual condition requiring both market leadership and material catastrophic risk (Anthropic, 2026-02-24)[2]
  • Safety ratings downgraded: SaferAI dropped Anthropic’s safety score from 2.2 to 1.9, placing the company in the “weak” category alongside OpenAI and DeepMind (SaferAI, 2026-02)[3]
  • Non-binding roadmaps: The replacement Frontier Safety Roadmaps are described as non-binding goals that Anthropic will self-grade (Anthropic, 2026-02-24)[2]
  • Safeguards team lead resigned: Mrinank Sharma, head of the Safeguards Research team, resigned publicly on February 9, warning about organizational pressures to deprioritize safety (Semafor, 2026-02-11)[5]
  • Pentagon confrontation: Defense Secretary Hegseth delivered an ultimatum over $200M contract before designating Anthropic a supply chain risk on March 3[1]
  • Stated rationale: Anthropic cited ambiguous model evaluations, an anti-regulatory political climate, and the inability to coordinate safety pauses across the industry[2]
  • GovAI assessment: The Centre for the Governance of AI noted the lack of binding commitments, stating the framework amounts to “trust Anthropic”[4]

Threat Patterns Involved

Primary: Safety Governance Override — The RSP was Anthropic’s defining safety governance mechanism, established in 2023 with a categorical pause trigger. The replacement framework removed this binding commitment in favor of non-binding, self-graded roadmaps. This represents a case where a formal safety governance structure existed, was recognized as the company’s flagship commitment, and was then fundamentally weakened by leadership decision.

Secondary: Accumulative Risk & Trust Erosion — The RSP weakening joins a broader pattern of frontier AI companies reducing safety commitments: OpenAI dissolved two safety teams and removed “safely” from its mission in 2026, and Anthropic’s safety rating was downgraded to the same category as OpenAI and DeepMind.

Significance

  1. Voluntary safety commitments are structurally fragile: The RSP was the most widely cited example of voluntary safety governance in frontier AI. Its weakening after less than three years suggests that voluntary commitments lack the institutional durability to withstand sustained competitive and political pressure.
  2. Self-grading creates accountability gaps: Replacing binding commitments with non-binding, self-graded roadmaps shifts the burden from the company demonstrating safety to observers proving danger. GovAI’s assessment that the new framework amounts to “trust Anthropic” highlights this structural change.
  3. Regulatory implications: If the most safety-oriented frontier AI company weakens its own commitments, the case for mandatory regulation strengthens. The incident provides evidence that voluntary governance is insufficient for systems with global reach.
  4. Dual-condition loophole: The requirement that Anthropic must both “lead the AI race” and face “material catastrophic risk” may effectively eliminate the pause trigger. In a competitive market, Anthropic can always argue it is not leading, making the condition unfulfillable.

Timeline

Anthropic publishes RSP v1.0, pledging to pause training if safety measures prove inadequate

RSP v2.2 takes effect, refining capability thresholds

Mrinank Sharma, head of Safeguards Research team, publicly resigns from Anthropic

Defense Secretary Hegseth meets with CEO Dario Amodei, reportedly delivering an ultimatum over Pentagon contract restrictions

Anthropic publishes RSP v3.0, removing categorical pause trigger and introducing dual-condition framework

Hegseth designates Anthropic a 'supply chain risk,' barring Pentagon use

Outcomes

Recovery:
Anthropic stated it would publish Risk Reports every 3-6 months reviewed by third-party experts. The Frontier Safety Roadmaps are described as non-binding.
Regulatory Action:
Defense Secretary Hegseth designated Anthropic a 'supply chain risk to national security' on March 3, 2026, barring its use by the Pentagon.

Use in Retrieval

INC-26-0092 documents Anthropic Removes Categorical Safety Pause Trigger from Responsible Scaling Policy, a critical-severity incident classified under the Human-AI Control domain and the Safety Governance Override threat pattern (PAT-CTL-006). It occurred in North America, Global (2026-02-24). This page is maintained by TopAIThreats.com as part of an evidence-based registry of AI-enabled threats. Cite as: TopAIThreats.com, "Anthropic Removes Categorical Safety Pause Trigger from Responsible Scaling Policy," INC-26-0092, last updated 2026-04-06.

Sources

  1. Exclusive: Anthropic Drops Flagship Safety Pledge (news, 2026-02-25)
    https://time.com/7380854/exclusive-anthropic-drops-flagship-safety-pledge/ (opens in new tab)
  2. Responsible Scaling Policy Version 3.0 (primary, 2026-02-24)
    https://www.anthropic.com/news/responsible-scaling-policy-v3 (opens in new tab)
  3. Anthropic's Responsible Scaling Policy Update Makes a Step Backwards (analysis, 2026-02)
    https://www.safer-ai.org/anthropics-responsible-scaling-policy-update-makes-a-step-backwards (opens in new tab)
  4. Anthropic's RSP v3.0: How it Works, What's Changed, and Some Reflections (analysis, 2026-03)
    https://www.governance.ai/analysis/anthropics-rsp-v3-0-how-it-works-whats-changed-and-some-reflections (opens in new tab)
  5. Anthropic safety researcher quits, warning 'world is in peril' (news, 2026-02-11)
    https://www.semafor.com/article/02/11/2026/anthropic-safety-researcher-quits-warning-world-is-in-peril (opens in new tab)

Update Log

  • — First logged (Status: Confirmed, Evidence: Primary)