This folder contains implementations for machine unlearning methods on LLM360 models. Machine unlearning is a pre-deployment safety measure designed to remove hazardous knowledge from language models. Unlearned models are inherently safe, as they lack the knowledge to be misused.
Here's a list of unlearning methods we have implemented so far.
Method | Model |
---|---|
max_entropy | CrystalChat |
min_posterior | CrystalChat |
random_matching | CrystalChat |
RMU | CrystalChat |
unlearn.py
is the main entrypoint for running unlearning methods. It uses python modules in methods/
and utils/
folders.
The methods/
folder contains the implementations for unlearning methods:
training.py
: All training loop implementationsutils.py
: Loss functions and other method-related utils
The utils/
folder contains helper functions for model/dataset IO:
data_utils.py
: Dataloader for text datasetsmodel_utils.py
: Model IO utils
By default, unlearned models are saved to models/
folder. Please store all training datasets to the data/
folder.
Note
This project uses the bio-forget-corpus from the WMDP Benchmark for unlearning training. Access to this dataset requires a separate request. Please follow the instructions provided here to obtain the necessary permissions. By default, the dataloader is configured to load the dataset from data/bio_forget.jsonl
.
- Clone and enter the repo:
git clone https://github.com/LLM360/Analysis360.git cd Analysis360/analysis/unlearning
- Install dependencies:
pip install -r requirements.txt
- To install
lm-eval
, please check the installation instructions in themetrics/harness
folder.
An example usage is provided in the demo.ipynb, which can be executed with a single A100 80G
GPU.