Skip to content
/ Atoxia Public

[NAACL-2025] Atoxia: Red-teaming Large Language Models with Target Toxic Answers

Notifications You must be signed in to change notification settings

DuYooho/Atoxia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Atoxia

This is the official repository for the paper: Atoxia: Red-teaming Large Language Models with Target Toxic Answers accepted at Findings of NAACL 2025.

Authors: Yuhao Du*, Zhuo Li*, Pengyu Cheng, Xiang Wan, Anningzhe Gao#

Introduction

Atoxia is a series of models that detect toxic potential in modern LLMs. It is built on the Mistral-7B-Instruct-v0.2 foundation, and it offers special models to detect toxicity in many popular LLMs.

Available Models

We currently plan to release the following models:

  • Atoxia-finetuned-on-llama2: Specifically fine-tuned to detect toxicity in Llama2-7b.
  • Atoxia-finetuned-on-llama3: Specifically fine-tuned to detect toxicity in Llama3-8b.
  • Atoxia-finetuned-on-mistral: Specifically fine-tuned to detect toxicity in Mistral-7b.
  • Atoxia-finetuned-on-vicuna: Specifically fine-tuned to detect toxicity in Vicuna-7b.

Note: While each model is fine-tuned for a specific target LLM, they can be transferred to detect toxicity in other modern LLMs like GPT4. Performance may vary depending on the target model.

Demo

Explore the Atoxia-finetuned-on-llama2 model in modelscope studio: https://modelscope.cn/studios/EyhooDu/ToxDet-finetuned-on-llama2

TODO List

  • Release training code.
  • Release demo.
  • Release pre-trained models.
    • ToxDet-finetuned-on-llama2
    • ToxDet-finetuned-on-llama3
    • ToxDet-finetuned-on-mistral
    • ToxDet-finetuned-on-vicuna

Datasets

  1. AdvBench
  2. HH-Harmless

Requirements

conda env create -f environment.yml

Train

#!/bin/bash

set -e
set -x

export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1

# DeepSpeed Team
DATA_PATH=$DATA_PATH
ACTOR_MODEL_PATH="Mistral-7B-Instruct-v0.2"
REWARD_MODEL_PATH="Mistral-7B-Instruct-v0.2"
ACTOR_ZERO_STAGE=2
REWARD_ZERO_STAGE=3
REFERENCE_ZERO_STAGE=3
TIME_STEP=`date "+%Y-%m-%d-%H-%M-%S"`
OUTPUT="./log/mistral-7b/$TIME_STEP"
SEED=2024
KL=0.05

mkdir -p $OUTPUT


ACTOR_LR=1e-6

CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 12346 main.py \
   --algo "remax" \
   --data_path $DATA_PATH \
   --data_output_path "./tmp/data_files/mistral" \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --reward_model_name_or_path $REWARD_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_generation_batch_size 1 \
   --per_device_training_batch_size 1 \
   --per_device_eval_batch_size 1 \
   --generation_batches 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 128 \
   --max_prompt_seq_len 128 \
   --actor_learning_rate ${ACTOR_LR} \
   --actor_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --actor_gradient_checkpointing \
   --disable_actor_dropout \
   --disable_reward_dropout \
   --num_warmup_steps 0 \
   --kl_ctl $KL \
   --gamma 1.0 \
   --deepspeed \
   --seed $SEED \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --reward_zero_stage $REWARD_ZERO_STAGE \
   --reference_zero_stage $REFERENCE_ZERO_STAGE \
   --enable_hybrid_engine \
   --output_dir $OUTPUT \
   --enable_tensorboard \
   --print_answers \
   --save_answers

Eval

Please refer to AdvPrompter for keyword matching evaluation.

Citation

If you use any of the models or code in this repository, please cite the following paper:

@misc{du2024atoxia,
      title={Atoxia: Red-teaming Large Language Models with Target Toxic Answers}, 
      author={Yuhao Du and Zhuo Li and Pengyu Cheng and Xiang Wan and Anningzhe Gao},
      year={2024},
      eprint={2408.14853},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

[NAACL-2025] Atoxia: Red-teaming Large Language Models with Target Toxic Answers

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages