CognitiveOverload

Code for our NAACL 2024 Paper "Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking"

Datasets

We adopt the following two datasets with malicious prompts:

AdvBench from Universal and transferable adversarial attacks on aligned language models
MasterKey from Jailbreaker: Automated jailbreak across multiple large language model chatbots.

We translate the original English prompts to 52 other languages with Google Cloud API. You can find the original and translated prompts in folder ./datasets

Jailbreaking with Multilingual Cognitive Overload

Run the following script to get baseline performance, i.e., LLMs prompted with English prompts

export dataset="AdvBench"
export model_name="vicuna-7b"
CUDA_VISIBLE_DEVICES=0 python perform_multilingual_attack.py \
    --dataset $dataset \
    --model-name $model_name  \
    --max-tokens 128 \
    --attack en \
    --max-batch-size 16

Harmful Prompting in Various Languages

Run the following script to get results when prompting LLMs with all other languages that the LLM supports:

export dataset="AdvBench"
export model_name="vicuna-7b"
CUDA_VISIBLE_DEVICES=0 python perform_multilingual_attack.py \
    --dataset $dataset \
    --model-name $model_name  \
    --max-tokens 128 \
    --attack monolingual \
    --max-batch-size 16

Language Switching: from English to Lan X vs. from Lan X to English

Extract keywords from prompts

export dataset="AdvBench"
CUDA_VISIBLE_DEVICES=0 python prepare_multilingual.py \
    --stage extract_keywords \
    --dataset $dataset

retrieve definition of keywords from wikipedia

CUDA_VISIBLE_DEVICES=0 python prepare_multilingual.py \
    --stage wiki_definition \
    --dataset $dataset

translate definitions into 52 other languages

CUDA_VISIBLE_DEVICES=0 python prepare_multilingual.py \
    --stage translate_context \
    --dataset $dataset

attack LLMs in 2-turn, first english then other language or in reverse order

export model_name="vicuna-7b"
echo "English first then other language"
CUDA_VISIBLE_DEVICES=0 python perform_multilingual_attack.py \
    --dataset $dataset \
    --model-name $model_name  \
    --max-tokens 128 \
    --attack multilingual \
    --max-batch-size 16 \
    --en-first
echo "other language first then English"
CUDA_VISIBLE_DEVICES=0 python perform_multilingual_attack.py \
    --dataset $dataset \
    --model-name $model_name  \
    --max-tokens 128 \
    --attack multilingual \
    --max-batch-size 16

Jailbreaking with Veiled Expressions

extract sensitive words

export dataset="AdvBench"
CUDA_VISIBLE_DEVICES=0 python perform_veiled_attack.py \
    --dataset $dataset \
    --stage extract_sensitive \
    --max-batch-size 16 \
    --model-name Mistral-7B-Instruct-v0.1

clean extracted sensitive words

CUDA_VISIBLE_DEVICES=0 python perform_veiled_attack.py \
    --dataset $dataset \
    --stage clean_word \
    --model-name Mistral-7B-Instruct-v0.1

replace sensitive words with veiled expressions

CUDA_VISIBLE_DEVICES=0 python perform_veiled_attack.py \
    --dataset $dataset \
    --stage replace_sensitive \
    --max-batch-size 16 \
    --model-name Mistral-7B-Instruct-v0.1

attack LLMs with veiled expressions

export model_name="vicuna-7b"
CUDA_VISIBLE_DEVICES=0 python perform_veiled_attack.py \
    --dataset $dataset \
    --stage sensitive_attack \
    --max-batch-size 16 \
    --model-name $model_name \
    --num-demo 8

Jailbreaking with Effect-to-Cause Cognitive Overload

extract events with in-context learning

export dataset="AdvBench"
for model_name in 'Mistral-7B-v0.1' 'llama2-13b'
do
    CUDA_VISIBLE_DEVICES=0 python perform_effect_to_cause.py \
        --dataset $dataset \
        --model-name $model_name \
        --stage extract_event \
        --max-batch-size 16
done

filter extracted events

python perform_effect_to_cause.py \
    --dataset $dataset \
    --stage filter_event

perform effect-to-cause attacks

export model_name="vicuna-7b"
CUDA_VISIBLE_DEVICES=0 python perform_effect_to_cause.py \
    --dataset $dataset \
    --model-name $model_name \
    --stage attack \
    --max-batch-size 16

Citation

We appreciate data contribution from these two papers:

@article{zou2023universal,
  title={Universal and transferable adversarial attacks on aligned language models},
  author={Zou, Andy and Wang, Zifan and Carlini, Nicholas and Nasr, Milad and Kolter, J Zico and Fredrikson, Matt},
  journal={arXiv preprint arXiv:2307.15043},
  year={2023}
}
@article{deng2023jailbreaker,
  title={Jailbreaker: Automated jailbreak across multiple large language model chatbots},
  author={Deng, Gelei and Liu, Yi and Li, Yuekang and Wang, Kailong and Zhang, Ying and Li, Zefeng and Wang, Haoyu and Zhang, Tianwei and Liu, Yang},
  journal={arXiv preprint arXiv:2307.08715},
  year={2023}
}

If you find our work helpful, please consider cite our work:

@inproceedings{xu2024cognitive,
  title={Cognitive overload: Jailbreaking large language models with overloaded logical thinking},
  author={Xu, Nan and Wang, Fei and Zhou, Ben and Li, Bang Zheng and Xiao, Chaowei and Chen, Muhao},
  booktitle={NAACL - Findings},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
datasets		datasets
README.md		README.md
language_llm.py		language_llm.py
model_utilities.py		model_utilities.py
perform_effect_to_cause.py		perform_effect_to_cause.py
perform_multilingual_attack.py		perform_multilingual_attack.py
perform_veiled_attack.py		perform_veiled_attack.py
prepare_multilingual.py		prepare_multilingual.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CognitiveOverload

Datasets

Jailbreaking with Multilingual Cognitive Overload

Harmful Prompting in Various Languages

Language Switching: from English to Lan X vs. from Lan X to English

Jailbreaking with Veiled Expressions

Jailbreaking with Effect-to-Cause Cognitive Overload

Citation

About

Releases

Packages

Contributors 2

Languages

luka-group/CognitiveOverload

Folders and files

Latest commit

History

Repository files navigation

CognitiveOverload

Datasets

Jailbreaking with Multilingual Cognitive Overload

Harmful Prompting in Various Languages

Language Switching: from English to Lan X vs. from Lan X to English

Jailbreaking with Veiled Expressions

Jailbreaking with Effect-to-Cause Cognitive Overload

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages