Skip to content

Commit

Permalink
Move BCO to separate BCOTrainer with fixes (#1869)
Browse files Browse the repository at this point in the history
* kto_trainer: skip KL data for BCO

* kto_trainer: BCO allow no positives or no negatives in batch

* kto_trainer: make RunningMoments object serializable

* add BCOTrainer

* fix BCO UDM for not interleaved data

* kto_trainer: remove unused UDM part

* bco_trainer: add tests and docs, minor fixes

* code style fixes

* Update docs/source/bco_trainer.mdx

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* fix BCO UDM for bfloat16

* Update trl/trainer/bco_config.py

* Update trl/trainer/bco_config.py

Co-authored-by: Seungjae Jung <seanexplode@gmail.com>

* Update trl/trainer/utils.py

Co-authored-by: Seungjae Jung <seanexplode@gmail.com>

* Update trl/trainer/bco_trainer.py

Co-authored-by: Seungjae Jung <seanexplode@gmail.com>

* Update trl/trainer/bco_config.py

* Update _toctree.yml

* Update trl/trainer/bco_config.py

* Update trl/trainer/bco_trainer.py

* RunningMoments, fix multi GPU serialization

* fix tests

---------

Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Seungjae Jung <seanexplode@gmail.com>
  • Loading branch information
4 people authored Jul 28, 2024
1 parent 6171cdd commit 9929370
Show file tree
Hide file tree
Showing 12 changed files with 2,179 additions and 351 deletions.
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@
title: DPO Trainer
- local: kto_trainer
title: KTO Trainer
- local: bco_trainer
title: BCO Trainer
- local: cpo_trainer
title: CPO Trainer
- local: ddpo_trainer
Expand Down
139 changes: 139 additions & 0 deletions docs/source/bco_trainer.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# BCO Trainer

TRL supports the Binary Classifier Optimization (BCO).
The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
For a full example have a look at [`examples/scripts/bco.py`].

## Expected dataset format

The BCO trainer expects a very specific format for the dataset as it does not require pairwise preferences. Since the model will be trained to directly optimize examples that consist of a prompt, model completion, and a label to indicate whether the completion is "good" or "bad", we expect a dataset with the following columns:

- `prompt`
- `completion`
- `label`

for example:

```
bco_dataset_dict = {
"prompt": [
"Hey, hello",
"How are you",
"What is your name?",
"What is your name?",
"Which is the best programming language?",
"Which is the best programming language?",
"Which is the best programming language?",
],
"completion": [
"hi nice to meet you",
"leave me alone",
"I don't have a name",
"My name is Mary",
"Python",
"C++",
"Java",
],
"label": [
True,
False,
False,
True,
True,
False,
False,
],
}
```

where the `prompt` contains the context inputs, `completion` contains the corresponding responses and `label` contains the corresponding flag that indicates if the generated completion is desired (`True`) or undesired (`False`).
A prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays. It is required that the dataset contains at least one desirable and one undesirable completion.


## Expected model format
The BCO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.

## Using the `BCOTrainer`

For a detailed example have a look at the `examples/scripts/bco.py` script. At a high level we need to initialize the `BCOTrainer` with a `model` we wish to train and a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response.

The `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the `model` and `ref_model` need to have the same architecture (ie decoder only or encoder-decoder).



```py
training_args = BCOConfig(
beta=0.1,
)

bco_trainer = BCOTrainer(
model,
model_ref,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
)
```
After this one can then call:

```py
bco_trainer.train()
```

## Underlying Distribution matching (UDM)

In practical scenarios, the thumbs-up and thumbs-down datasets are likely to have divergent underlying distributions of prompts.
Consider an LLM deployed for user feedback: if the model excels in writing tasks but underperforms in coding, the thumbs-up dataset will be dominated by writing-related prompts, while the thumbs-down dataset will contain mostly coding-related prompts.
If the prompts in your desired and undesired datasets differ a lot, it is useful to enable UDM.

Choose an embedding model and tokenizer:

```py
embedding_model = AutoModel.from_pretrained(your_model_id)
embedding_tokenizer = AutoTokenizer.from_pretrained(your_model_id)

# customize this function depending on your embedding model
def embed_prompt(input_ids, attention_mask, model):
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
return outputs.last_hidden_state.mean(dim=1)

embedding_model = Accelerator().prepare_model(self.embedding_model)
embedding_func = partial(embed_prompt, model=embedding_model)
```

Set `prompt_sample_size` to defined how many prompts are selected to train the UDM classifier and start the training with the provided embedding function:

```py
training_args = BCOConfig(
beta=0.1,
prompt_sample_size=512,
)

bco_trainer = BCOTrainer(
model,
model_ref,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
embedding_func=embedding_func,
embedding_tokenizer=self.embedding_tokenizer,
)

bco_trainer.train()
```

### For Mixture of Experts Models: Enabling the auxiliary loss

MOEs are the most efficient if the load is about equally distributed between experts.
To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.

This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).
To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: 0.001).

## BCOTrainer

[[autodoc]] BCOTrainer

## BCOConfig

[[autodoc]] BCOConfig
7 changes: 0 additions & 7 deletions docs/source/kto_trainer.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -85,13 +85,6 @@ After this one can then call:
kto_trainer.train()
```

## Loss Functions

Given the binary signal data indicating whether a completion is desirable or undesirable for a prompt, we can optimize an implicit reward function that aligns with the key principles of Kahneman-Tversky's prospect theory, such as reference dependence, loss aversion, and diminishing sensitivity.

The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
The `KTOTrainer` can be switched to this loss via the `loss_type="bco"` argument.

### For Mixture of Experts Models: Enabling the auxiliary loss

MOEs are the most efficient if the load is about equally distributed between experts.
Expand Down
22 changes: 10 additions & 12 deletions examples/scripts/bco.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
--no_remove_unused_columns \
--warmup_ratio 0.1 \
--bf16 \
--loss_type bco \
--report_to wandb
# QLoRA:
Expand All @@ -46,7 +45,6 @@
--no_remove_unused_columns \
--warmup_ratio 0.1 \
--bf16 \
--loss_type bco \
--use_peft \
--load_in_4bit \
--lora_target_modules=all-linear \
Expand All @@ -65,14 +63,14 @@
from datasets import Dataset, load_dataset
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, PreTrainedModel

from trl import KTOConfig, KTOTrainer, ModelConfig, get_peft_config, setup_chat_format
from trl import BCOConfig, BCOTrainer, ModelConfig, get_peft_config, setup_chat_format


# Define and parse arguments.
@dataclass
class ScriptArguments:
"""
The arguments for the KTO training script.
The arguments for the BCO training script.
"""

llm_name: Literal["gpt-3.5-turbo", "llama-2-7b-chat", "llama-2-70b-chat"] = "gpt-3.5-turbo"
Expand Down Expand Up @@ -160,10 +158,10 @@ def mean_pooling(model_output, attention_mask):


if __name__ == "__main__":
parser = HfArgumentParser((ScriptArguments, KTOConfig, ModelConfig))
script_args, kto_args, model_args = parser.parse_args_into_dataclasses()
parser = HfArgumentParser((ScriptArguments, BCOConfig, ModelConfig))
script_args, bco_args, model_args = parser.parse_args_into_dataclasses()

kto_args.gradient_checkpointing_kwargs = {"use_reentrant": True}
bco_args.gradient_checkpointing_kwargs = {"use_reentrant": True}

# Load a pretrained model
model = AutoModelForCausalLM.from_pretrained(
Expand Down Expand Up @@ -213,11 +211,11 @@ def format_dataset(example):
model=embedding_model,
)

# Initialize the KTO trainer
kto_trainer = KTOTrainer(
# Initialize the BCO trainer
bco_trainer = BCOTrainer(
model,
ref_model,
args=kto_args,
args=bco_args,
train_dataset=formatted_dataset["train"],
eval_dataset=formatted_dataset["test"],
tokenizer=tokenizer,
Expand All @@ -227,5 +225,5 @@ def format_dataset(example):
)

# Train and push the model to the Hub
kto_trainer.train()
kto_trainer.save_model(kto_args.output_dir)
bco_trainer.train()
bco_trainer.save_model(bco_args.output_dir)
Loading

0 comments on commit 9929370

Please sign in to comment.