RLHF (Reinforcement Learning from Human Feedback)

A training technique where human evaluators rank model outputs to steer AI behavior toward being more helpful, harmless, and honest.

RLHF is a training technique where humans evaluate and rank model outputs, and those preferences become a training signal. It's how a raw foundation model — fluent but unpredictable — gets steered toward being helpful, harmless, and honest rather than just statistically likely.

The same underlying model can feel notably different depending on who did the RLHF and what they optimized for. Bold versus cautious. Terse versus verbose. Willing to refuse versus willing to engage. That's largely RLHF policy, not architecture. This matters for procurement: many behavioral differences between model versions — or between providers running the same base model — come from tuning decisions, not capability jumps. When a model "improves" or "gets worse" between versions, RLHF changes are often the cause.

It also explains why guardrails at the application layer remain necessary even with heavily tuned models. RLHF shapes general tendencies; it doesn't enforce application-specific constraints. A model tuned to be broadly safe can still behave badly in your specific context.

RLHF is not something most teams interact with directly — it happens at the model vendor level. But understanding it is useful context for evaluating model behavior, interpreting version changelogs, and knowing where application-layer controls actually belong.

RLHF (Reinforcement Learning from Human Feedback)

Related Terms