Atoxia

This is the official repository for the paper: Atoxia: Red-teaming Large Language Models with Target Toxic Answers accepted at Findings of NAACL 2025.

Authors: Yuhao Du*, Zhuo Li*, Pengyu Cheng, Xiang Wan, Anningzhe Gao#

Introduction

Atoxia is a series of models that detect toxic potential in modern LLMs. It is built on the Mistral-7B-Instruct-v0.2 foundation, and it offers special models to detect toxicity in many popular LLMs.

Available Models

We currently plan to release the following models:

Atoxia-finetuned-on-llama2: Specifically fine-tuned to detect toxicity in Llama2-7b.
Atoxia-finetuned-on-llama3: Specifically fine-tuned to detect toxicity in Llama3-8b.
Atoxia-finetuned-on-mistral: Specifically fine-tuned to detect toxicity in Mistral-7b.
Atoxia-finetuned-on-vicuna: Specifically fine-tuned to detect toxicity in Vicuna-7b.

Note: While each model is fine-tuned for a specific target LLM, they can be transferred to detect toxicity in other modern LLMs like GPT4. Performance may vary depending on the target model.

Demo

Explore the Atoxia-finetuned-on-llama2 model in modelscope studio: https://modelscope.cn/studios/EyhooDu/ToxDet-finetuned-on-llama2

TODO List

Datasets

Requirements

conda env create -f environment.yml

Train

#!/bin/bash

set -e
set -x

export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1

# DeepSpeed Team
DATA_PATH=$DATA_PATH
ACTOR_MODEL_PATH="Mistral-7B-Instruct-v0.2"
REWARD_MODEL_PATH="Mistral-7B-Instruct-v0.2"
ACTOR_ZERO_STAGE=2
REWARD_ZERO_STAGE=3
REFERENCE_ZERO_STAGE=3
TIME_STEP=`date "+%Y-%m-%d-%H-%M-%S"`
OUTPUT="./log/mistral-7b/$TIME_STEP"
SEED=2024
KL=0.05

mkdir -p $OUTPUT


ACTOR_LR=1e-6

CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 12346 main.py \
   --algo "remax" \
   --data_path $DATA_PATH \
   --data_output_path "./tmp/data_files/mistral" \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --reward_model_name_or_path $REWARD_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_generation_batch_size 1 \
   --per_device_training_batch_size 1 \
   --per_device_eval_batch_size 1 \
   --generation_batches 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 128 \
   --max_prompt_seq_len 128 \
   --actor_learning_rate ${ACTOR_LR} \
   --actor_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --actor_gradient_checkpointing \
   --disable_actor_dropout \
   --disable_reward_dropout \
   --num_warmup_steps 0 \
   --kl_ctl $KL \
   --gamma 1.0 \
   --deepspeed \
   --seed $SEED \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --reward_zero_stage $REWARD_ZERO_STAGE \
   --reference_zero_stage $REFERENCE_ZERO_STAGE \
   --enable_hybrid_engine \
   --output_dir $OUTPUT \
   --enable_tensorboard \
   --print_answers \
   --save_answers

Eval

Please refer to AdvPrompter for keyword matching evaluation.

Citation

If you use any of the models or code in this repository, please cite the following paper:

@misc{du2024atoxia,
      title={Atoxia: Red-teaming Large Language Models with Target Toxic Answers}, 
      author={Yuhao Du and Zhuo Li and Pengyu Cheng and Xiang Wan and Anningzhe Gao},
      year={2024},
      eprint={2408.14853},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
utils		utils
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
main.py		main.py
perf.py		perf.py
remax_trainer.py		remax_trainer.py
rlhf_engine.py		rlhf_engine.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Atoxia

Introduction

Available Models

Demo

TODO List

Datasets

Requirements

Train

Eval

Citation

About

Releases

Packages

Languages

DuYooho/Atoxia

Folders and files

Latest commit

History

Repository files navigation

Atoxia

Introduction

Available Models

Demo

TODO List

Datasets

Requirements

Train

Eval

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages