Move BCO to separate BCOTrainer with fixes (#1869)

* kto_trainer: skip KL data for BCO * kto_trainer: BCO allow no positives or no negatives in batch * kto_trainer: make RunningMoments object serializable * add BCOTrainer * fix BCO UDM for not interleaved data * kto_trainer: remove unused UDM part * bco_trainer: add tests and docs, minor fixes * code style fixes * Update docs/source/bco_trainer.mdx Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> * fix BCO UDM for bfloat16 * Update trl/trainer/bco_config.py * Update trl/trainer/bco_config.py Co-authored-by: Seungjae Jung <seanexplode@gmail.com> * Update trl/trainer/utils.py Co-authored-by: Seungjae Jung <seanexplode@gmail.com> * Update trl/trainer/bco_trainer.py Co-authored-by: Seungjae Jung <seanexplode@gmail.com> * Update trl/trainer/bco_config.py * Update _toctree.yml * Update trl/trainer/bco_config.py * Update trl/trainer/bco_trainer.py * RunningMoments, fix multi GPU serialization * fix tests --------- Co-authored-by: Clara Luise Pohland <clara-luise.pohland@telekom.de> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Seungjae Jung <seanexplode@gmail.com>
huggingface · Jul 28, 2024 · 9929370 · 9929370
1 parent 6171cdd
commit 9929370
Show file tree

Hide file tree

Showing 12 changed files with 2,179 additions and 351 deletions.
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -39,6 +39,8 @@
     title: DPO Trainer
   - local: kto_trainer
     title: KTO Trainer
+  - local: bco_trainer
+    title: BCO Trainer
   - local: cpo_trainer
     title: CPO Trainer
   - local: ddpo_trainer

diff --git a/docs/source/bco_trainer.mdx b/docs/source/bco_trainer.mdx
@@ -0,0 +1,139 @@
+# BCO Trainer
+
+TRL supports the Binary Classifier Optimization (BCO).
+The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
+For a full example have a look at  [`examples/scripts/bco.py`].
+
+## Expected dataset format
+
+The BCO trainer expects a very specific format for the dataset as it does not require pairwise preferences. Since the model will be trained to directly optimize examples that consist of a prompt, model completion, and a label to indicate whether the completion is "good" or "bad", we expect a dataset with the following columns:
+
+- `prompt`
+- `completion`
+- `label`
+
+for example:
+
+```
+bco_dataset_dict = {
+    "prompt": [
+        "Hey, hello",
+        "How are you",
+        "What is your name?",
+        "What is your name?",
+        "Which is the best programming language?",
+        "Which is the best programming language?",
+        "Which is the best programming language?",
+    ],
+    "completion": [
+        "hi nice to meet you",
+        "leave me alone",
+        "I don't have a name",
+        "My name is Mary",
+        "Python",
+        "C++",
+        "Java",
+    ],
+    "label": [
+        True,
+        False,
+        False,
+        True,
+        True,
+        False,
+        False,
+    ],
+}
+```
+
+where the `prompt` contains the context inputs, `completion` contains the corresponding responses and `label` contains the corresponding flag that indicates if the generated completion is desired (`True`) or undesired (`False`).
+A prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays. It is required that the dataset contains at least one desirable and one undesirable completion.
+
+
+## Expected model format
+The BCO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
+
+## Using the `BCOTrainer`
+
+For a detailed example have a look at the `examples/scripts/bco.py` script. At a high level we need to initialize the `BCOTrainer` with a `model` we wish to train and a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response. 
+
+The `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the `model` and `ref_model` need to have the same architecture (ie decoder only or encoder-decoder).
+
+
+
+```py
+training_args = BCOConfig(
+    beta=0.1,
+)
+
+bco_trainer = BCOTrainer(
+    model,
+    model_ref,
+    args=training_args,
+    train_dataset=train_dataset,
+    tokenizer=tokenizer,
+)
+```
+After this one can then call:
+
+```py
+bco_trainer.train()
+```
+
+## Underlying Distribution matching (UDM)
+
+In practical scenarios, the thumbs-up and thumbs-down datasets are likely to have divergent underlying distributions of prompts.
+Consider an LLM deployed for user feedback: if the model excels in writing tasks but underperforms in coding, the thumbs-up dataset will be dominated by writing-related prompts, while the thumbs-down dataset will contain mostly coding-related prompts.  
+If the prompts in your desired and undesired datasets differ a lot, it is useful to enable UDM.  
+
+Choose an embedding model and tokenizer:
+
+```py
+embedding_model = AutoModel.from_pretrained(your_model_id)
+embedding_tokenizer = AutoTokenizer.from_pretrained(your_model_id)
+
+# customize this function depending on your embedding model
+def embed_prompt(input_ids, attention_mask, model):
+    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
+    return outputs.last_hidden_state.mean(dim=1)
+
+embedding_model = Accelerator().prepare_model(self.embedding_model)
+embedding_func = partial(embed_prompt, model=embedding_model)
+```
+
+Set `prompt_sample_size` to defined how many prompts are selected to train the UDM classifier and start the training with the provided embedding function:
+
+```py
+training_args = BCOConfig(
+    beta=0.1,
+    prompt_sample_size=512,
+)
+
+bco_trainer = BCOTrainer(
+    model,
+    model_ref,
+    args=training_args,
+    train_dataset=train_dataset,
+    tokenizer=tokenizer,
+    embedding_func=embedding_func,
+    embedding_tokenizer=self.embedding_tokenizer,
+)
+
+bco_trainer.train()
+```
+
+### For Mixture of Experts Models: Enabling the auxiliary loss
+
+MOEs are the most efficient if the load is about equally distributed between experts.  
+To ensure that we train MOEs similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.  
+
+This option is enabled by setting `output_router_logits=True` in the model config (e.g. MixtralConfig).  
+To scale how much the auxiliary loss contributes to the total loss, use the hyperparameter `router_aux_loss_coef=...` (default: 0.001).
+
+## BCOTrainer
+
+[[autodoc]] BCOTrainer
+
+## BCOConfig
+
+[[autodoc]] BCOConfig
diff --git a/docs/source/kto_trainer.mdx b/docs/source/kto_trainer.mdx
@@ -85,13 +85,6 @@ After this one can then call:
 kto_trainer.train()
 ```
 
-## Loss Functions
-
-Given the binary signal data indicating whether a completion is desirable or undesirable for a prompt, we can optimize an implicit reward function that aligns with the key principles of Kahneman-Tversky's prospect theory, such as reference dependence, loss aversion, and diminishing sensitivity.
-
-The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
-The `KTOTrainer` can be switched to this loss via the `loss_type="bco"` argument.
-
 ### For Mixture of Experts Models: Enabling the auxiliary loss
 
 MOEs are the most efficient if the load is about equally distributed between experts.  

diff --git a/examples/scripts/bco.py b/examples/scripts/bco.py
@@ -21,7 +21,6 @@
     --no_remove_unused_columns \
     --warmup_ratio 0.1 \
     --bf16 \
-    --loss_type bco \
     --report_to wandb
 
 # QLoRA:
@@ -46,7 +45,6 @@
     --no_remove_unused_columns \
     --warmup_ratio 0.1 \
     --bf16 \
-    --loss_type bco \
     --use_peft \
     --load_in_4bit \
     --lora_target_modules=all-linear \
@@ -65,14 +63,14 @@
 from datasets import Dataset, load_dataset
 from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer, HfArgumentParser, PreTrainedModel
 
-from trl import KTOConfig, KTOTrainer, ModelConfig, get_peft_config, setup_chat_format
+from trl import BCOConfig, BCOTrainer, ModelConfig, get_peft_config, setup_chat_format
 
 
 # Define and parse arguments.
 @dataclass
 class ScriptArguments:
     """
-    The arguments for the KTO training script.
+    The arguments for the BCO training script.
     """
 
     llm_name: Literal["gpt-3.5-turbo", "llama-2-7b-chat", "llama-2-70b-chat"] = "gpt-3.5-turbo"
@@ -160,10 +158,10 @@ def mean_pooling(model_output, attention_mask):
 
 
 if __name__ == "__main__":
-    parser = HfArgumentParser((ScriptArguments, KTOConfig, ModelConfig))
-    script_args, kto_args, model_args = parser.parse_args_into_dataclasses()
+    parser = HfArgumentParser((ScriptArguments, BCOConfig, ModelConfig))
+    script_args, bco_args, model_args = parser.parse_args_into_dataclasses()
 
-    kto_args.gradient_checkpointing_kwargs = {"use_reentrant": True}
+    bco_args.gradient_checkpointing_kwargs = {"use_reentrant": True}
 
     # Load a pretrained model
     model = AutoModelForCausalLM.from_pretrained(
@@ -213,11 +211,11 @@ def format_dataset(example):
         model=embedding_model,
     )
 
-    # Initialize the KTO trainer
-    kto_trainer = KTOTrainer(
+    # Initialize the BCO trainer
+    bco_trainer = BCOTrainer(
         model,
         ref_model,
-        args=kto_args,
+        args=bco_args,
         train_dataset=formatted_dataset["train"],
         eval_dataset=formatted_dataset["test"],
         tokenizer=tokenizer,
@@ -227,5 +225,5 @@ def format_dataset(example):
     )
 
     # Train and push the model to the Hub
-    kto_trainer.train()
-    kto_trainer.save_model(kto_args.output_dir)
+    bco_trainer.train()
+    bco_trainer.save_model(bco_args.output_dir)