Skip to content

Code of NAACL 2024 paper "Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections".

Notifications You must be signed in to change notification settings

CaoYuanpu/BackdoorUnalign

Repository files navigation

BackdoorUnalign

Poisoning dataset

data/poison_long_trigger_llama2.jsonl

Installation

pip install -r requirements.txt

Step 1: Backdoor Attack

CUDA_VISIBLE_DEVICES=<your device id> python backdoor.py

Step 2: Generation

python generate.py --device <your device id>

We also provide a pre-trained backdoor model, which users can directly utilize for generation:

python generate_pretrained.py --device <your device id>

Step 3: Auto evaluation by GPT-4

python auto_eval.py --model gpt-4 --key <OpenAI API Key>

Realignment

Step 1: Merge and upload backdoored model

python upload.py --device <your device id>

Step 2: Realign by fine-tuning on safety data

CUDA_VISIBLE_DEVICES=<your device id> python realign.py --model_name <backdoor model name>

Then, you can reuse generate.py and change model_name, new_model, and res_path accordinglly to perform generation.

About

Code of NAACL 2024 paper "Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages