Skip to content

yihedeng9/rlhf-summary-notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

RLHF Summary Notes

A brief and partial summary of RLHF algorithms. This repository summarizes a list of papers and useful blogs for RLHF that are covered in my reading group presentation of a brief summary for RLHF algorithms. Please find the slides here.

Why RLHF?

LLMs pre-trained on large text corpus express unintended behaviors such as hallucinations, bias/toxicity, or failure to follow instructions.

  • Misaligned: language modeling objective (next token prediction) is different from the objective of human values (helpful, honest, harmless).

RLHF is proposed to align a model trained on general corpus to complex human values.

  • Use human feedback for generated text as a measure of performance and use that feedback as a loss to optimize the model.
  • Use methods from RL to directly optimize a language model with human feedback.

Learning from (Human/AI) Preference Feedback

  1. Preference Reward Modeling

    • Requires building a reward model based on user preferences, optimized with RL, typically using the PPO (Proximal Policy Optimization) algorithm.
    • Computationally expensive and sensitive to hyper-parameter selection.
  2. Direct Preference Optimization

    • Views preference optimization as offline RL, with implicit reward model.
    • Starting with DPO (Direct Preference Optimization), the evolution of variations of DPO aims in adjusting its loss function, with ongoing fixes that make it more RL-like and overcome existing weaknesses.

Bradley-Terry model

Assumption of most RLHF algorithms: the preference signal can be modeled using the reward-based Bradley-Terry model.

Alt text

RLHF Workflow: From Reward Modeling to Online RLHF.
  • For Bradley-Terry model, the reward maximization approach is limited by the nature of “point-wise” rewards (scalar score for a single response to input x), which fails to express complex intransitive or cyclic preference relations. [DNO]

Online RL

Offline RL

Beyond Bradley-Terry Model

Bias of Length

Step-wise (Process) Reward

Related to Process Reward Model (PRM)

Related to Monte Carlo Tree Search (MTCS)

Iterative DPO (Multiple Iterations)

List Ranking: Beyond Pairwise Preference

Preference Data Construction

Self-training: preference siginal other than human/AI labeling for each data pair?

SFT

Useful Blogs

Releases

No releases published

Packages

No packages published