Regularization chapter + smol edits (#17)

Co-authored-by: Nathan Lambert <nathan@huggingface.co>
natolambert · Oct 4, 2024 · f4b187a · f4b187a
1 parent 4c7d98e
commit f4b187a
Show file tree

Hide file tree

Showing 7 changed files with 164 additions and 56 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
diff --git a/README.md b/README.md
@@ -1,13 +1,12 @@
-# Pandoc book template
+# RLHF Book
+Built on **Pandoc book template**.
 
-[![CircleCI](https://circleci.com/gh/wikiti/pandoc-book-template.svg?style=shield)](https://circleci.com/gh/wikiti/pandoc-book-template)
 [![Code License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/wikiti/pandoc-book-template/blob/master/LICENSE.md)
 [![Content License](https://img.shields.io/badge/license-CC--BY--NC--SA--4.0-lightgrey)](https://github.com/natolambert/rlhf-book/blob/main/LICENSE-Content.md)
 
-## RLHF Book
-
 This is a work-in-progress textbook covering the fundamentals of Reinforcement Learning from Human Feedback (RLHF).
 The code is licensed with the MIT license, but the content for the book found in `chapters/` is licensed under the [Creative Commons Non-Commerical Attribution License](https://creativecommons.org/licenses/by-nc/4.0/deed.en), CC BY-NC 4.0.
+This is meant for people with a basic ML and/or software background.
 
 ### Citation
 To cite this book, please use the following format.

diff --git a/chapters/03-optimization.md b/chapters/03-optimization.md
@@ -3,4 +3,6 @@
 
 ## Maximizing Expected Reward
 
+TODO: The idea of a "KL Budget" for optimization
+
 ## Example: Mitigating Safety
diff --git a/chapters/07-reward-models.md b/chapters/07-reward-models.md
@@ -1 +1,29 @@
-# Reward Modeling
+# Reward Modeling
+
+TODO: Have both the InstructGPT and Anthropic loss formulations, which are slightly different
+
+## Training Reward Models
+
+There are two popular expressions for how to train a reward model -- they are numerically equivalent.
+
+$$
+\mathcal{L}(\theta) = - \left[ \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) \right) \right]
+$$
+[@ouyang2022training]
+
+$$
+\mathcal{L}(\theta) = \log \left( 1 + e^{r_{\theta}(x, y_l)}  - e^{r_{\theta}(x, y_w)} \right)
+$$
+[@askell2021general]
+
+## Implementation Example
+
+Implementing the reward modeling loss is quite simple.
+More of the implementation challenge is on setting up a separate data loader and inference pipeline.
+```python
+import torch.nn as nn
+rewards_chosen = model(**inputs_chosen)
+rewards_rejected = model(**inputs_rejected)
+
+loss = -nn.functional.logsigmoid(rewards_chosen - rewards_rejected).mean()
+```
diff --git a/chapters/08-regularization.md b/chapters/08-regularization.md
@@ -1,23 +1,108 @@
 # Regularization
 
-Throughout the RLHF optimization, many regularization steps are used to prevent over-optimization
+Throughout the RLHF optimization, many regularization steps are used to prevent over-optimization of the reward model.
+Over-optimization in these contexts looks like models that output nonsensical text.
+Some examples of optimization ``off the rails'' are that models can output followable math reasoning with extremely incorrect answers, repeated text, switching languages, or excessive special characters.
 
-## KL Distances
+The most popular variant, used in most RLHF implementations at the time of writing, is a KL Distance from the current policy to a reference policy across the generated samples.
+Many other regularization techniques have emerged in the literature to then disappear in the next model iteration in that line of research.
+That is to say that regularization outside the core KL distance from generations is often used to stabilize experimental setups that can then be simplified in the next generations.
+Still, it is important to understand tools to constrain optimization in RLHF.
 
-### Reference Policy
+The general formulation, when used in an RLHF framework with a reward model, $r_\theta$ is as follows:
 
-### Reference Dataset
+$$ r = r_\theta - \lambda r_{\text{reg.}} $$ {eq:rl_start}
 
-## Likelihood Penalty
+With the reference implementation being:
 
-- https://arxiv.org/abs/2404.19733 on DPO loss
+$$
+r = r_\theta - \lambda_{\text{KL}} \mathcal{D}_{\text{KL}} \left( \pi^{\text{RL}}(y \mid x) \, \| \, \pi^{\text{Ref.}}(y \mid x) \right)
+$$ {#eq:kl_standard}
 
-## Reward Bonuses
+## KL Distances in RL Optimization
 
-- Nemotron
+For mathematical definitions, see Chapter 5 on Problem Setup.
+Recall that KL distance is defined as follows:
 
-## Margin Losses
+$$ D_{KL}(P || Q) = \sum_{x \in \mathcal{X}} P(x) \log \left(\frac{P(x)}{Q(x)}\right) $$
 
-- Llama 2
-- Rebel
-- Reward Preference Optimization (Nemotron)
+### Reference Model to Generations
+
+The most common implementation of KL penalities are by comparing the distance between the generated tokens during training to a static reference model.
+The intuition is that the model you're training from has a style that you would like to stay close to.
+This reference model is most often the instruction tuned model, but can also be a previous RL checkpoint.
+With simple substitution, the model we are sampling from becomes $P(x)^{\text{RL}}$ and $P(x)^{\text{Ref.}}, shown above in @eq:kl_standard.
+Such KL distance was first applied to dialogue agents well before the popularity of large language models [@jaques2017sequence], yet KL control was quickly established as a core technique for fine-tuning pretrained models [@jaques2020human].
+
+### Implementation Example
+
+In practice, the implementation of KL distance is often approximated [@schulman2016klapprox], making the implementation far simpler.
+With the above definition, the summation of KL can be converted to an expectation when sampling directly from the distribution $P(X)$.
+In this case, the distribution $P(X)$ is the generative distribution of the model currently being trained (i.e. not the reference model).
+Then, the computation for KL distance changes to the following:
+
+$$
+D_{\text{KL}}(P \,||\, Q) = \mathbb{E}_{x \sim P} \left[ \log P(x) - \log Q(x) \right].
+$$
+
+This mode is far simpler to implement, particularly when dealing directly with log probabilities used frequently in language model training.
+
+```python
+import torch.nn.functional as F
+# Step 1: Generate tokens using the trained model's policy
+generated_tokens = model.generate(inputs)
+
+# Step 2: Get logits for both models using the generated tokens as context
+logits = model.forward(inputs) # technically redundant
+ref_logits = ref_model.forward(inputs)
+logprobs = convert_to_logpbs(logits) # softmax and normalize
+ref_logprobs = convert_to_logpbs(ref_logits)
+
+kl_approx = logprob - ref_logprob
+kl_full = F.kl_div(ref_logprob, logprob) # alternate computation
+```
+Some example implementations include [TRL](https://github.com/huggingface/trl/blob/5c21de30ae210e4251ead85517ba8dfe3f210e81/trl/trainer/ppo_trainer.py#L1150) and [Hamish Ivison's Jax Code]https://github.com/hamishivi/EasyLM/blob/main/EasyLM/models/llama/llama_train_ppo.py#L278)
+
+## Pretraining Gradients
+
+Another way of viewing regularization is that you may have a *dataset* that you want the model to remain close to, as done in InstructGPT [@ouyang2022training] ``in order to fix the
+performance regressions on public NLP datasets''.
+To implement this, they modify the training objective for RLHF.
+Taking @eq:rl_start, we can transform this into an objective function to optimize by sampling from the RL policy model, completions $y$ from prompts $x$, which yields:
+$$
+\text{objective} (\theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}_{\pi^{\text{RL}}_{\theta}}} \left[ r_{\theta}(x, y) - \lambda r_{\text{reg.}} \right]
+$$
+Then, we can add an additional reward for higher probabilities on pretraining accuracy:
+$$
+\text{objective} (\theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}_{\pi^{\text{RL}}_{\theta}}} \left[ r_{\theta}(x, y) - \lambda r_{\text{reg.}} \right] + \gamma \mathbb{E}_{x \sim \mathcal{D}_{\text{pretrain}}} \left[ \log(\pi^{\text{RL}}_{\theta}(x)) \right]
+$$
+
+[@pang2024iterative] proposed using using a negative log likelihood term to balance the optimization of Direct Preference Optimization (DPO).
+Given the pairwise nature of the DPO loss, the same loss modification can be made to reward model training, constraining the model to predict accurate text (rumors from laboratories that did not publish the work).
+
+The optimization follows as a modification to DPO.
+$$\mathcal{L}_{\text{DPO+NLL}} = \mathcal{L}_{\text{DPO}}(c_i^w, y_i^w, c_i^l, y_i^l \mid x_i) + \alpha \mathcal{L}_{\text{NLL}}(c_i^w, y_i^w \mid x_i)
+$$
+
+$$
+= -\log \sigma \left( \beta \log \frac{M_\theta(c_i^w, y_i^w \mid x_i)}{M_t(c_i^w, y_i^w \mid x_i)} - \beta \log \frac{M_\theta(c_i^l, y_i^l \mid x_i)}{M_t(c_i^l, y_i^l \mid x_i)} \right) - \alpha \frac{\log M_\theta(c_i^w, y_i^w \mid x_i)}{|c_i^w| + |y_i^w|}.
+$$
+
+TODO: Make the above equations congruent with the rest of the notation on DPO.
+
+## Other Regularization
+
+Controlling the optimization is less well defined in other parts of the RLHF stack.
+Most reward models have no regularization beyond the standard contrastive loss function.
+Direct Alignment Algorithms handle regulaization to KL distances differently, through the $\Beta$ parameter (see the chapter on Direct Alignment).
+
+Llama 2 proposed a margin loss for reward model training [@touvron2023llama]:
+
+$$
+\mathcal{L}(\theta) = - \left[ \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) - m(r) \right) \right]
+$$
+
+Where $m(r)$ is the numerical difference in delta between the ratings of two annotators.
+This is either achieved by having annotators rate the outputs on a numerical scale or by using a quantified ranking method, such as [Likert scales](https://en.wikipedia.org/wiki/Likert_scale).
+
+Reward margins have been used heavily in the direct alignment literature, such as Reward weighted DPO, ``Reward-aware Preference Optimization'' (RPO), which integrates reward model scores into the update rule following a DPO loss [@adler2024nemotron], or REBEL [@gao2024rebel] that has a reward delta weighting in a regression-loss formulation.
diff --git a/chapters/11-policy-gradients.md b/chapters/11-policy-gradients.md
@@ -35,4 +35,10 @@ Reinforce is a specific implementation of vanilla policy gradient that uses a Mo
 TODO. Cite:
 https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#
 
-https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
+https://lilianweng.github.io/posts/2018-04-08-policy-gradient/
+
+### KL Controllers
+
+TODO: adaptive vs static KL control 
+
+See tAble 10 for impelementation details in tulu 2.5 paper
diff --git a/chapters/bib.bib b/chapters/bib.bib
@@ -33,6 +33,21 @@ @article{gilks1992adaptive
 
 ################################################################################################
 
+# KL Refs ####################################################################
+@article{jaques2020human,
+  title={Human-centric dialog training via offline reinforcement learning},
+  author={Jaques, Natasha and Shen, Judy Hanwen and Ghandeharioun, Asma and Ferguson, Craig and Lapedriza, Agata and Jones, Noah and Gu, Shixiang Shane and Picard, Rosalind},
+  journal={arXiv preprint arXiv:2010.05848},
+  year={2020}
+}
+@inproceedings{jaques2017sequence,
+  title={Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control},
+  author={Jaques, Natasha and Gu, Shixiang and Bahdanau, Dzmitry and Hern{\'a}ndez-Lobato, Jos{\'e} Miguel and Turner, Richard E and Eck, Douglas},
+  booktitle={International Conference on Machine Learning},
+  pages={1645--1654},
+  year={2017},
+  organization={PMLR}
+}
 # RLHF Core ####################################################################
 @article{christiano2017deep,
   title={Deep reinforcement learning from human preferences},
@@ -86,6 +101,8 @@ @article{touvron2023llama
   year={2023}
 }
 
+# RLHF More ########################################################################
+
 # LLM as a Judge ####################################################################
 @article{zheng2023judging,
   title={Judging llm-as-a-judge with mt-bench and chatbot arena},
@@ -106,4 +123,13 @@ @article{huang2024empirical
   author={Huang, Hui and Qu, Yingqi and Liu, Jing and Yang, Muyun and Zhao, Tiejun},
   journal={arXiv preprint arXiv:2403.02839},
   year={2024}
+}
+
+Misc Blogs %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+@misc{schulman2016klapprox,
+  author = {Schulman, John},
+  title = {Approximating KL-divergence},
+  year = {2016},
+  howpublished = {\url{http://joschu.net/blog/kl-approx.html}},
+  note = {Accessed: 2024-10-01}
 }
Original file line number	Diff line number	Diff line change
Expand Up		@@ -3,4 +3,6 @@

		## Maximizing Expected Reward

		TODO: The idea of a "KL Budget" for optimization

		## Example: Mitigating Safety