Official implementation of the NeurIPS 2024 paper Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning.
Reinforcement learning (RL) has emerged as a pivotal technique for fine-tuning large language models (LLMs) on specific tasks. However, prevailing RL fine-tuning methods predominantly rely on PPO and its variants. Though these algorithms are effective in general RL settings, they often exhibit suboptimal performance and vulnerability to distribution collapse when applied to the fine-tuning of LLMs. In this paper, we propose CORY, extending the RL fine-tuning of LLMs to a sequential cooperative multi-agent reinforcement learning framework, to leverage the inherent coevolution and emergent capabilities of multi-agent systems. In CORY, the LLM to be fine-tuned is initially duplicated into two autonomous agents: a pioneer and an observer. The pioneer generates responses based on queries, while the observer generates responses using both the queries and the pioneer’s responses. The two agents are trained together. During training, the agents exchange roles periodically, fostering cooperation and coevolution between them. Experiments evaluate CORY’s performance by fine-tuning GPT-2 and Llama-2 under subjective and objective reward functions on the IMDB Review and GSM8K datasets, respec- tively. Results show that CORY outperforms PPO in terms of policy optimality, resistance to distribution collapse, and training robustness, thereby underscoring its potential as a superior methodology for refining LLMs in real-world applications.
To ensure that you can run the code in the same environment as it was developed, we recommend using the Conda environment management tool to replicate our development environment. Follow the steps below to quickly set up your environment.
If you haven't already installed Conda, please visit the Anaconda website to download and install Anaconda. Anaconda is a free and open-source distribution of Python and R programming languages for scientific computing, that aims to simplify package management and deployment. After installation, you can check if it was successful by opening a terminal or command prompt and typing the following command:
conda --version
We have provided a trl_environment.yml file that contains all the dependencies required to run the code. Please follow the steps below to create and activate the environment:
create conda environment
conda env create -n new_env_name -f trl_environment.yml
activate the environment
conda env create -n new_env_name -f trl_environment.yml
- Download gpt-2 model and distilbert-imdb
- Run the following command:
Fine-tuning with PPO
python imdb_train/ppo.py
Fine-tuning with CORY
python imdb_train/cory.py
If you find this repository useful, please cite our paper:
@misc{ma2024coevolvingyoufinetuningllm,
title={Coevolving with the Other You: Fine-Tuning LLM with Sequential Cooperative Multi-Agent Reinforcement Learning},
author={Hao Ma and Tianyi Hu and Zhiqiang Pu and Boyin Liu and Xiaolin Ai and Yanyan Liang and Min Chen},
year={2024},
eprint={2410.06101},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2410.06101},
}
We will continue to maintain this code repository in the coming months.