Skip to main content
TopAIThreats home TOP AI THREATS
AI Capability

RLHF (Reinforcement Learning from Human Feedback)

A training technique that aligns language model behavior with human preferences by using human evaluators to rank model outputs, then training the model to prefer higher-ranked responses.

Definition

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique used to align large language model behavior with human values and safety requirements. The process involves three stages: supervised fine-tuning on human-written examples, training a reward model on human preference rankings of model outputs, and optimizing the language model against the reward model using reinforcement learning (typically Proximal Policy Optimization). RLHF is the primary mechanism through which commercial LLMs learn to refuse harmful requests, follow instructions helpfully, and produce outputs consistent with their providers’ content policies.

How It Relates to AI Threats

RLHF is central to AI threat dynamics because it is both the primary safety mechanism and a primary attack target. Within Security & Cyber, jailbreak attacks specifically target the brittleness of RLHF alignment — exploiting the fact that safety training provides probabilistic behavioral constraints rather than absolute guarantees. The model retains the capability to generate harmful content; RLHF teaches it when to refuse. Novel conversational framings that fall outside the RLHF training distribution can bypass this overlay. Within Human–AI Control, RLHF represents the current state-of-the-art for maintaining human control over AI output — and its known limitations define the boundary of that control.

Why It Occurs

  • RLHF is a behavioral overlay, not a capability removal — the model’s underlying knowledge and generation capabilities remain intact
  • The space of possible adversarial framings is combinatorially large, making complete coverage through human evaluation impractical
  • Safety training competes with capability — aggressive RLHF reduces both jailbreak success rates and model usefulness
  • In-context learning can override RLHF conditioning within a single conversation through many-shot jailbreaking

Real-World Context

RLHF was popularized by OpenAI’s InstructGPT paper (2022) and is now used by all major LLM providers (OpenAI, Anthropic, Google, Meta) as a core safety mechanism. Anthropic’s Constitutional AI extends RLHF with principle-based self-evaluation. Despite continuous improvements, the arms race between jailbreak techniques and RLHF-based defenses continues, with new bypass methods discovered on a monthly cadence.

Last updated: 2026-03-22