A brief and partial summary of RLHF algorithms. This repository summarizes a list of papers and useful blogs for RLHF that are covered in my reading group presentation of a brief summary for RLHF algorithms. Please find the slides here.
LLMs pre-trained on large text corpus express unintended behaviors such as hallucinations, bias/toxicity, or failure to follow instructions.
- Misaligned: language modeling objective (next token prediction) is different from the objective of human values (helpful, honest, harmless).
RLHF is proposed to align a model trained on general corpus to complex human values.
- Use human feedback for generated text as a measure of performance and use that feedback as a loss to optimize the model.
- Use methods from RL to directly optimize a language model with human feedback.
-
Preference Reward Modeling
- Requires building a reward model based on user preferences, optimized with RL, typically using the PPO (Proximal Policy Optimization) algorithm.
- Computationally expensive and sensitive to hyper-parameter selection.
-
Direct Preference Optimization
- Views preference optimization as offline RL, with implicit reward model.
- Starting with DPO (Direct Preference Optimization), the evolution of variations of DPO aims in adjusting its loss function, with ongoing fixes that make it more RL-like and overcome existing weaknesses.
Assumption of most RLHF algorithms: the preference signal can be modeled using the reward-based Bradley-Terry model.
RLHF Workflow: From Reward Modeling to Online RLHF.- For Bradley-Terry model, the reward maximization approach is limited by the nature of “point-wise” rewards (scalar score for a single response to input x), which fails to express complex intransitive or cyclic preference relations. [DNO]
- Training language models to follow instructions with human feedback.
- (Summary) RLHF Workflow: From Reward Modeling to Online RLHF.
- (Summary) Secrets of RLHF in Large Language Models Part I: PPO.
- Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint
- (Algorithm) PPO: Proximal Policy Optimization Algorithms.
- (Algorithm) RLOO: Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs.
- (Algorithm) GRPO: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.
- (Algorithm) ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
- (Algorithm) MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences
- DPO: Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
- RPO: Iterative Reasoning Preference Optimization.
- Additional NLL loss.
- KTO: Model Alignment as Prospect Theoretic Optimization.
- DNO: Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences.
- R-DPO: Disentangling Length from Quality in Direct Preference Optimization.
- SimPO: Simple Preference Optimization with a Reference-Free Reward.
- Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
- ReST-MCTS∗: LLM Self-Training via Process Reward Guided Tree Search.
- Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning
Related to Process Reward Model (PRM)
- Let's Verify Step by Step
- Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Related to Monte Carlo Tree Search (MTCS)
- LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning.
- Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning.
- SAIL: Self-Improving Efficient Online Alignment of Large Language Models
- GSHF: Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint
- Self-Reward: Self-Rewarding Language Models
- SPIN: Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
- Building Math Agents with Multi-Turn Iterative Preference Learning.
- RRHF: Rank Responses to Align Language Models with Human Feedback without tears.
- Preference Ranking Optimization for Human Alignment.
- LiPO: Listwise Preference Optimization through Learning-to-Rank.
Self-training: preference siginal other than human/AI labeling for each data pair?
- Self-Consistency Preference Optimization.
- (Multi-modal) Aligning modalities in vision large language models via preference fine-tuning.
- (Multi-modal) Enhancing Large Vision Language Models with Self-Training on Image Comprehension.
- Policy Gradient Algorithms
- Going Deeper Into Reinforcement Learning: Fundamentals of Policy Gradients
- Illustrating Reinforcement Learning from Human Feedback (RLHF)
- Proximal Policy Optimization (PPO)
- 关于LLM+RL(HF)的片面脉络梳理
- Advanced Tricks for Training Large Language Models with Proximal Policy Optimization