This is the official repository for the paper: Atoxia: Red-teaming Large Language Models with Target Toxic Answers accepted at Findings of NAACL 2025.
Authors: Yuhao Du*, Zhuo Li*, Pengyu Cheng, Xiang Wan, Anningzhe Gao#
Atoxia is a series of models that detect toxic potential in modern LLMs. It is built on the Mistral-7B-Instruct-v0.2 foundation, and it offers special models to detect toxicity in many popular LLMs.
We currently plan to release the following models:
- Atoxia-finetuned-on-llama2: Specifically fine-tuned to detect toxicity in Llama2-7b.
- Atoxia-finetuned-on-llama3: Specifically fine-tuned to detect toxicity in Llama3-8b.
- Atoxia-finetuned-on-mistral: Specifically fine-tuned to detect toxicity in Mistral-7b.
- Atoxia-finetuned-on-vicuna: Specifically fine-tuned to detect toxicity in Vicuna-7b.
Note: While each model is fine-tuned for a specific target LLM, they can be transferred to detect toxicity in other modern LLMs like GPT4. Performance may vary depending on the target model.
Explore the Atoxia-finetuned-on-llama2 model in modelscope studio: https://modelscope.cn/studios/EyhooDu/ToxDet-finetuned-on-llama2
- Release training code.
- Release demo.
- Release pre-trained models.
- ToxDet-finetuned-on-llama2
- ToxDet-finetuned-on-llama3
- ToxDet-finetuned-on-mistral
- ToxDet-finetuned-on-vicuna
conda env create -f environment.yml
#!/bin/bash
set -e
set -x
export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1
# DeepSpeed Team
DATA_PATH=$DATA_PATH
ACTOR_MODEL_PATH="Mistral-7B-Instruct-v0.2"
REWARD_MODEL_PATH="Mistral-7B-Instruct-v0.2"
ACTOR_ZERO_STAGE=2
REWARD_ZERO_STAGE=3
REFERENCE_ZERO_STAGE=3
TIME_STEP=`date "+%Y-%m-%d-%H-%M-%S"`
OUTPUT="./log/mistral-7b/$TIME_STEP"
SEED=2024
KL=0.05
mkdir -p $OUTPUT
ACTOR_LR=1e-6
CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 12346 main.py \
--algo "remax" \
--data_path $DATA_PATH \
--data_output_path "./tmp/data_files/mistral" \
--data_split 2,4,4 \
--actor_model_name_or_path $ACTOR_MODEL_PATH \
--reward_model_name_or_path $REWARD_MODEL_PATH \
--num_padding_at_beginning 1 \
--per_device_generation_batch_size 1 \
--per_device_training_batch_size 1 \
--per_device_eval_batch_size 1 \
--generation_batches 1 \
--ppo_epochs 1 \
--max_answer_seq_len 128 \
--max_prompt_seq_len 128 \
--actor_learning_rate ${ACTOR_LR} \
--actor_weight_decay 0.1 \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--gradient_accumulation_steps 1 \
--actor_gradient_checkpointing \
--disable_actor_dropout \
--disable_reward_dropout \
--num_warmup_steps 0 \
--kl_ctl $KL \
--gamma 1.0 \
--deepspeed \
--seed $SEED \
--actor_zero_stage $ACTOR_ZERO_STAGE \
--reward_zero_stage $REWARD_ZERO_STAGE \
--reference_zero_stage $REFERENCE_ZERO_STAGE \
--enable_hybrid_engine \
--output_dir $OUTPUT \
--enable_tensorboard \
--print_answers \
--save_answers
Please refer to AdvPrompter for keyword matching evaluation.
If you use any of the models or code in this repository, please cite the following paper:
@misc{du2024atoxia,
title={Atoxia: Red-teaming Large Language Models with Target Toxic Answers},
author={Yuhao Du and Zhuo Li and Pengyu Cheng and Xiang Wan and Anningzhe Gao},
year={2024},
eprint={2408.14853},
archivePrefix={arXiv},
primaryClass={cs.CL}
}