Skip to content

Commit

Permalink
A few minor typos (#49)
Browse files Browse the repository at this point in the history
Co-authored-by: Jacob Dineen <jacobdineen@MacBook-Air-2.local>
  • Loading branch information
jacobdineen and Jacob Dineen authored Feb 3, 2025
1 parent 3adb7d1 commit e28fc6e
Show file tree
Hide file tree
Showing 7 changed files with 9 additions and 9 deletions.
4 changes: 2 additions & 2 deletions chapters/02-related-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ next-url: "03-setup.html"

In this chapter we detail the key papers and projects that got the RLHF field to where it is today.
This is not intended to be a comprehensive review on RLHF and the related fields, but rather a starting point and retelling of how we got to today.
It is intentionally focused on recent work that lead to ChatGPT.
It is intentionally focused on recent work that led to ChatGPT.
There is substantial further work in the RL literature on learning from preferences [@wirth2017survey].
For a more exhaustive list, you should use a proper survey paper [@kaufmann2023survey],[@casper2023open].

Expand Down Expand Up @@ -44,7 +44,7 @@ Aside from applications, a number of seminal papers defined key areas for the fu
Work continued on refining RLHF for application to chat models.
Anthropic continued to use it extensively for early versions of Claude [@bai2022training] and early RLHF open-source tools emerged [@ramamurthy2022reinforcement],[@havrilla-etal-2023-trlx],[@vonwerra2022trl].

## 2023 to Present: ChatGPT Eta
## 2023 to Present: ChatGPT Era

The announcement of ChatGPT was very clear in the role of RLHF in its training [@openai2022chatgpt]:

Expand Down
2 changes: 1 addition & 1 deletion chapters/05-preferences.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Together, each of these areas brings specific assumptions at what a preference i
In practice, RLHF methods are motivated and studied from the perspective of empirical alignment -- maximizing model performance on specific skills instead of measuring the calibration to specific values.
Still, the origins of value alignment for RLHF methods continue to be studied through research on methods to solve for ``pluralistic alignment'' across populations, such as position papers [@conitzer2024social], [@mishra2023ai], new datasets [@kirk2024prism], and personalization methods [@poddar2024personalizing].

The goal of this chapter is to illustrate how complex motivations results in presumptions about the nature of tools used in RLHF that do often not apply in practice.
The goal of this chapter is to illustrate how complex motivations result in presumptions about the nature of tools used in RLHF that do often not apply in practice.
The specifics of obtaining data for RLHF is discussed further in Chapter 6 and using it for reward modeling in Chapter 7.
For an extended version of this chapter, see [@lambert2023entangled].

Expand Down
2 changes: 1 addition & 1 deletion chapters/06-preference-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ TODO example of thumbs up / down with synthetic data or KTO

### Sourcing and Contracts

The first step is sourcing the vendor to provide data (or ones own annotators).
The first step is sourcing the vendor to provide data (or one's own annotators).
Much like acquiring access to cutting-edge Nvidia GPUs, getting access to data providers is also a who-you-know game. If you have credibility in the AI ecosystem, the best data companies will want you on our books for public image and long-term growth options. Discounts are often also given on the first batches of data to get training teams hooked.

If you’re a new entrant in the space, you may have a hard time getting the data you need quickly. Getting the tail of interested buying parties that Scale AI had to turn away is an option for the new data startups. It’s likely their primary playbook to bootstrap revenue.
Expand Down
2 changes: 1 addition & 1 deletion chapters/08-regularization.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ TODO: Make the above equations congruent with the rest of the notation on DPO.

Controlling the optimization is less well defined in other parts of the RLHF stack.
Most reward models have no regularization beyond the standard contrastive loss function.
Direct Alignment Algorithms handle regulaization to KL distances differently, through the $\beta$ parameter (see the chapter on Direct Alignment).
Direct Alignment Algorithms handle regularization to KL distances differently, through the $\beta$ parameter (see the chapter on Direct Alignment).

Llama 2 proposed a margin loss for reward model training [@touvron2023llama]:

Expand Down
2 changes: 1 addition & 1 deletion chapters/11-policy-gradients.md
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ Ratio here is the logratio of the new policy model probabilities relative to the
In order to understand this equation it is good to understand different cases that can fall within a batch of updates.
Remember that we want the loss to *decrease* as the model gets better at the task.

Case 1: Positive advantage, so the action was better than the expected value of the state. We want to reinforce this. In this case, the model will make this more likely with the negative sign. To do so it'll increase the logratio. A positive logratio, or sum of log probabilties of the tokens, means that the model is more likely to generate those tokens.
Case 1: Positive advantage, so the action was better than the expected value of the state. We want to reinforce this. In this case, the model will make this more likely with the negative sign. To do so it'll increase the logratio. A positive logratio, or sum of log probabilities of the tokens, means that the model is more likely to generate those tokens.

Case 2: Negative advantage, so the action was worse than the expected value of the state. This follows very similarly. Here, the loss will be positive if the new model was more likely, so the model will try to make it so the policy parameters make this completion less likely.

Expand Down
4 changes: 2 additions & 2 deletions chapters/17-over-optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Many sources of error exist [@schulman2023proxy]: Approximation error from rewar
This points to a fundamental question as to the limits of optimization the intents of data contractors relative to what downstream users want.

A potential solution is that *implicit* feedback will be measured from users of chatbots and models to tune performance.
Implicit feedback is actions taken by the user, such as re-rolling an output, closing the tab, or writing an angry message that indicates the quality of the previous response. The challenge here, and with most optimization changes to RLHF, is that there's a strong risk of losing stability when making the reward function more specific. RL, as a strong optimizer, is increasingly likely to exploit the reward function when it is a smooth surface (and not just pairwise human values). The expected solution to this is that future RLHF will be trained with both pairwise preference data and additional steering loss functions. There are also a bunch of different loss functions that can be used to better handle pairwise data, such as Mallow's model [@lu2011learning] or Placket-Luce [@liu2019learning].
Implicit feedback is actions taken by the user, such as re-rolling an output, closing the tab, or writing an angry message that indicates the quality of the previous response. The challenge here, and with most optimization changes to RLHF, is that there's a strong risk of losing stability when making the reward function more specific. RL, as a strong optimizer, is increasingly likely to exploit the reward function when it is a smooth surface (and not just pairwise human values). The expected solution to this is that future RLHF will be trained with both pairwise preference data and additional steering loss functions. There are also a bunch of different loss functions that can be used to better handle pairwise data, such as Mallow's model [@lu2011learning] or Plackett-Luce [@liu2019learning].

### Llama 2 and "too much RLHF"

Expand All @@ -67,7 +67,7 @@ TODO add references to minimax model? See tweets etc?

KL is the primary metric,

Put simply, the solution that will most likely play out is to use bigger models. Bigger models have more room for change in the very under-parametrized setting of a reward model (sample efficient part of the equation), so are less impacted.
Put simply, the solution that will most likely play out is to use bigger models. Bigger models have more room for change in the very under-parameterized setting of a reward model (sample efficient part of the equation), so are less impacted.
DPO may not benefit from this as much, the direct optimization will likely change sample efficiency one way or another.


Expand Down
2 changes: 1 addition & 1 deletion chapters/18-style.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ If RLHF is going to make language models simply more fun, that is delivered valu

TODO EDIT

RLHF or preference fine-tuning methods are being used mostly to boost scores like AlpacaEval and other automatic leaderboards without shifting the proportionally on harder-to-game evaluations like ChatBotArena. The paradox is that while alignment methods like DPO give a measure-able improvement on these models that does transfer into performance that people care about, a large swath of the models doing more or less the same thing take it way too far and publish evaluation scores that are obviously meaningless.
RLHF or preference fine-tuning methods are being used mostly to boost scores like AlpacaEval and other automatic leaderboards without shifting the proportionally on harder-to-game evaluations like ChatBotArena. The paradox is that while alignment methods like DPO give a measurable improvement on these models that does transfer into performance that people care about, a large swath of the models doing more or less the same thing take it way too far and publish evaluation scores that are obviously meaningless.

For how methods like DPO can simply make the model better, some of my older articles on scaling DPO and if we even need PPO can help. These methods, when done right, make the models easier to work with and more enjoyable. This often comes with a few percentage point improvements on evaluation tools like MT Bench or AlpacaEval (and soon Arena Hard will show the same). The problem is that you can also use techniques like DPO and PPO in feedback loops or in an abundance of data to actually lobotomize the model at the cost of LLM-as-a-judge performance. There are plenty of examples.

Expand Down

0 comments on commit e28fc6e

Please sign in to comment.