RLHF Summary Notes

A brief and partial summary of RLHF algorithms. This repository summarizes a list of papers and useful blogs for RLHF that are covered in my reading group presentation of a brief summary for RLHF algorithms. Please find the slides here.

Why RLHF?

LLMs pre-trained on large text corpus express unintended behaviors such as hallucinations, bias/toxicity, or failure to follow instructions.

Misaligned: language modeling objective (next token prediction) is different from the objective of human values (helpful, honest, harmless).

RLHF is proposed to align a model trained on general corpus to complex human values.

Use human feedback for generated text as a measure of performance and use that feedback as a loss to optimize the model.
Use methods from RL to directly optimize a language model with human feedback.

Learning from (Human/AI) Preference Feedback

Preference Reward Modeling
- Requires building a reward model based on user preferences, optimized with RL, typically using the PPO (Proximal Policy Optimization) algorithm.
- Computationally expensive and sensitive to hyper-parameter selection.
Direct Preference Optimization
- Views preference optimization as offline RL, with implicit reward model.
- Starting with DPO (Direct Preference Optimization), the evolution of variations of DPO aims in adjusting its loss function, with ongoing fixes that make it more RL-like and overcome existing weaknesses.

Bradley-Terry model

Assumption of most RLHF algorithms: the preference signal can be modeled using the reward-based Bradley-Terry model.

RLHF Workflow: From Reward Modeling to Online RLHF.

For Bradley-Terry model, the reward maximization approach is limited by the nature of “point-wise” rewards (scalar score for a single response to input x), which fails to express complex intransitive or cyclic preference relations. [DNO]

Online RL

Offline RL

DPO: Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
RPO: Iterative Reasoning Preference Optimization.
- Additional NLL loss.

Beyond Bradley-Terry Model

Bias of Length

Step-wise (Process) Reward

Related to Process Reward Model (PRM)

Related to Monte Carlo Tree Search (MTCS)

Iterative DPO (Multiple Iterations)

List Ranking: Beyond Pairwise Preference

Preference Data Construction

Self-training: preference siginal other than human/AI labeling for each data pair?

SFT

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
Reading_Group_RLHF.pdf		Reading_Group_RLHF.pdf
bt_model_rlhf_workflow.png		bt_model_rlhf_workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RLHF Summary Notes

Why RLHF?

Learning from (Human/AI) Preference Feedback

Bradley-Terry model

Online RL

Offline RL

Beyond Bradley-Terry Model

Bias of Length

Step-wise (Process) Reward

Iterative DPO (Multiple Iterations)

List Ranking: Beyond Pairwise Preference

Preference Data Construction

SFT

Useful Blogs

About

Releases

Packages

yihedeng9/rlhf-summary-notes

Folders and files

Latest commit

History

Repository files navigation

RLHF Summary Notes

Why RLHF?

Learning from (Human/AI) Preference Feedback

Bradley-Terry model

Online RL

Offline RL

Beyond Bradley-Terry Model

Bias of Length

Step-wise (Process) Reward

Iterative DPO (Multiple Iterations)

List Ranking: Beyond Pairwise Preference

Preference Data Construction

SFT

Useful Blogs

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages