Awesome-Attacks and Defenses on T2I Diffusion Models

This repository is a curated collection of research papers focused on $\textbf{Adversarial Attacks and Defenses on Text-to-Image Diffusion Models (AD-on-T2IDM)}$.

We will continuously update this collection to track the latest advancements in the field of AD-on-T2IDM.

Welcome to follow and star! If you have any relevant materials or suggestions, please feel free to contact us (zcy@tju.edu.cn) or submit a pull request.

For more detailed information, please refer to our survey paper: [ARXIV]， [Published Version]

🔔News

2024-09-12 Our survey "Adversarial Attacks and Defenses on Text-to-Image Diffusion Models" has been accepted by Information Fusion~(SCI-1, IF14.7).

Citation

@article{zhang2024adversarial,
  title={Adversarial attacks and defenses on text-to-image diffusion models: A survey},
  author={Zhang, Chenyu and Hu, Mingwang and Li, Wenhui and Wang, Lanjun},
  journal={Information Fusion},
  pages={102701},
  year={2024},
  publisher={Elsevier}
}

Abstract

Recently, the text-to-image diffusion model has gained considerable attention from the community due to its exceptional image generation capability. A representative model, Stable Diffusion, amassed more than 10 million users within just two months of its release. This surge in popularity has facilitated studies on the robustness and safety of the model, leading to the proposal of various adversarial attack methods. Simultaneously, there has been a marked increase in research focused on defense methods to improve the robustness and safety of these models. In this survey, we provide a comprehensive review of the literature on adversarial attacks and defenses targeting text-to-image diffusion models. We begin with an overview of popular text-to-image diffusion models, followed by an introduction to a taxonomy of adversarial attacks and an in-depth review of existing attack methods. We then present a detailed analysis of current defense methods that improve model robustness and safety. Finally, we discuss ongoing challenges and explore promising future research directions.

Overview of AD-on-T2IDM

Two key concerns in T2IDM: Robustness and Safety

The robustness ensures that the model can generate images with consistent semantics in response to diverse prompts inputted by users in practice.

The safety prevents the misuse of the model in creating malicious images, such as sexual, violent, and political images, etc.

Adversarial attacks

Based on the intent of the adversary, existing attack methods can be divided into two primary categories: untargeted and targeted attacks.

For untargeted attacks, consider a scenario with a prompt input by the user~($\textbf{clean prompt}$) and its corresponding output image~($\textbf{clean image}$). The objective of untargeted attacks is to subtly perturb the clean prompt to craft an $\textbf{adversarial prompt}$, further misleading the victim model to generate an $\textbf{adversarial image}$ with semantics different from the clean image. This type of attack is commonly used to uncover the vulnerability in the robustness of the victim model. Some untargeted attacks are shown as follows:
For targeted attacks, assumes that the victim model has built-in $\textbf{safeguards}$ to filter $\textbf{malicious prompts}$ and resultant $\textbf{malicious images}$. These prompts and images often explicitly contain $\textbf{malicious concepts}$, such as 'nudity', 'violence', and other predefined concepts. The objective of targeted attacks is to obtain an $\textbf{adversarial prompt}$, which can bypass these safeguards while inducing the victim model to generate $\textbf{adversarial images}$ containing malicious concepts. This type of attack is typically designed to reveal the vulnerability in the safety of the victim model. Some targeted attacks are shown as follows:

Defenses

Based on the defense goal, existing defense methods can be classified into two categories: 1) improving model robustness and 2) improving model safety.

The goal of robustness is to ensure that generated images have consistent semantics with diverse input prompts in practical applications. Specifically, according to the adversarial attack, the defense methods are asked to mitigate the robustness vulnerabilities in two types of input prompts: 1) the prompt with multiple objects and attributes, and 2) the grammatically incorrect prompt with the subtle noise.
The safety goal is to prevent the generation of malicious images in response to both malicious and adversarial prompts. Specifically, malicious prompts explicitly contain malicious concepts, while adversarial prompts cleverly omit these concepts. Moreover, based on the knowledge of the model, existing safety methods can be classified into two categories: external safeguards and internal safeguards. The external safeguards focus on detecting or correcting the malicious prompt before feeding the prompt into the text-to-image model. In contrast, internal safeguards aim to ensure that the semantics of output images deviate from those of malicious images by modifying internal parameters and features within the model. Some examples of external and internal safeguards are shown as follows:

Notably, although many methods are proposed to improve the model robustness against the prompt with multiple objects and attributes, this collection omits related papers on this part since there has been related surveys, such as controllable image generation [PDF], the development and advancement of image generation capabilities [PDF-1], [PDF-2], [PDF-3]. Moreover, for grammatically incorrect prompts with subtle noise, mature solutions are still lacking. Therefore, this collection mainly focuses on the defense methods for improving model safety.

😀Paper List

Adversarial Attacks
- Untargeted Attacks
  - White-Box Attacks
  - Black-Box Attacks
- Targeted Attacks
Defenses for Improving Safety
- External Safeguards
  - Prompt Classifier
  - Prompt Transformation
- Internal Safeguards
  - Model Editing
  - Inference Guidance

👿Adversarial Attacks

💥Untargeted Attacks

😾White-Box Attacks

Stable diffusion is unstable

Chengbin Du, Yanxi Li, Zhongwei Qiu, Chang Xu

NeurIPS 2024. [PDF] [CODE]

A pilot study of query-free adversarial attack against stable diffusion

Haomin Zhuang, Yihua Zhang

CVPRW 2023. [PDF] [CODE]

🙈Black-Box Attacks

Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks

Hongcheng Gao , Hao Zhang , Yinpeng Dong, Zhijie Deng

arxiv 2023. [PDF]

💢Targeted Attacks

🌀Attacking External Safeguards

Red-Teaming the Stable Diffusion Safety Filter

Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramèr

arxiv 2022. [PDF]

SneakyPrompt: Evaluating Robustness of Text-to-image Generative Models' Safety Filters

Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao

Proceedings of the IEEE Symposium on Security and Privacy 2024. [PDF] [CODE]

Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models

Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, Yang Zhang

Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. [PDF] [CODE]

Riatig: Reliable and imperceptible adversarial text-to-image generation with natural prompts

Han Liu, Yuhao Wu, Shixuan Zhai, Bo Yuan, Ning Zhang

CVPR 2023. [PDF] [CODE]

Mma-diffusion: Multimodal attack on diffusion models

Yang, Yijun and Gao, Ruiyuan and Wang, Xiaosen and Ho, Tsung-Yi and Xu, Nan and Xu, Qiang

CVPR 2024. [PDF] [CODE]

Black Box Adversarial Prompting for Foundation Models

Natalie Maus, Patrick Chao, Eric Wong, Jacob Gardner

arxiv 2023. [PDF] [CODE]

Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks

Haz Sameen Shahgir, Xianghao Kong, Greg Ver Steeg, Yue Dong

arxiv 2023. [PDF] [CODE]

Revealing vulnerabilities in stable diffusion via targeted attacks

Chenyu Zhang, Lanjun Wang, Anan Liu

arxiv 2024. [PDF] [CODE]

🐍Attacking Internal Safeguards

Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?

Tsai, Yu-Lin and Hsu, Chia-Yi and Xie, Chulin and Lin, Chih-Hsun and Chen, Jia-You and Li, Bo and Chen, Pin-Yu and Yu, Chia-Mu and Huang, Chun-Ying

ICLR 2024. [PDF]

To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now

Zhang, Yimeng and Jia, Jinghan and Chen, Xin and Chen, Aochuan and Zhang, Yihua and Liu, Jiancheng and Ding, Ke and Liu, Sijia

ECCV 2024. [PDF] [CODE]

Prompting4debugging: Red-teaming text-to-image diffusion models by finding problematic prompts

Chin, Zhi-Yi and Jiang, Chieh-Ming and Huang, Ching-Chun and Chen, Pin-Yu and Chiu, Wei-Chen

ICML 2024. [PDF] [CODE]

FLIRT: Feedback Loop In-context Red Teaming

Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta

arxiv 2023. [PDF]

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Jiachen Ma, Anda Cao, Zhiqing Xiao, Jie Zhang, Chao Ye, Junbo Zhao

arxiv 2024. [PDF]

🐸Attacking Black-Box Safeguards

SneakyPrompt: Evaluating Robustness of Text-to-image Generative Models' Safety Filters

Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, Yinzhi Cao

Proceedings of the IEEE Symposium on Security and Privacy 2024. [PDF] [CODE]

Mma-diffusion: Multimodal attack on diffusion models

Yang, Yijun and Gao, Ruiyuan and Wang, Xiaosen and Ho, Tsung-Yi and Xu, Nan and Xu, Qiang

CVPR 2024. [PDF] [CODE]

Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?

Tsai, Yu-Lin and Hsu, Chia-Yi and Xie, Chulin and Lin, Chih-Hsun and Chen, Jia-You and Li, Bo and Chen, Pin-Yu and Yu, Chia-Mu and Huang, Chun-Ying

ICLR 2024. [PDF]

Exploiting cultural biases via homoglyphs in text-to-image synthesis

Struppek, Lukas and Hintersdorf, Dom and Friedrich, Felix and Schramowski, Patrick and Kersting, Kristian

Journal of Artificial Intelligence Research 2023. [PDF] [CODE]

Adversarial Attacks on Image Generation With Made-Up Words

Raphaël Millière

arxiv 2022. [PDF]

SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution

Zhongjie Ba, Jieming Zhong, Jiachen Lei, Peng Cheng, Qinglong Wang, Zhan Qin, Zhibo Wang, Kui Ren

arxiv 2023. [PDF]

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image Models

Yimo Deng, Huangxun Chen

arxiv 2024. [PDF]

Groot: Adversarial Testing for Generative Text-to-Image Models with Tree-based Semantic Transformation

Yi Liu, Guowei Yang, Gelei Deng, Feiyue Chen, Yuqi Chen, Ling Shi, Tianwei Zhang, Yang Liu

arxiv 2024. [PDF]

BSPA: Exploring Black-box Stealthy Prompt Attacks against Image Generators

Yu Tian, Xiao Yang, Yinpeng Dong, Heming Yang, Hang Su, Jun Zhu

arxiv 2024. [PDF]

💊Defenses for Improving Safety

🏄External Safeguards

🚵Prompt Classifier

Latent Guard: a Safety Framework for Text-to-image Generation

Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, Fabio Pizzati

ECCV 2024. [PDF] [CODE]

🏇Prompt Transformation

Universal Prompt Optimizer for Safe Text-to-Image Generation

Zongyu Wu, Hongcheng Gao, Yueze Wang, Xiang Zhang, Suhang Wang

NAACL 2024. [PDF]

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

Yijun Yang, Ruiyuan Gao, Xiao Yang, Jianyuan Zhong, Qiang Xu

arxiv 2024. [PDF]

🍔Internal Safeguards

🍟Model Editing

Erasing concepts from diffusion models

Gandikota, Rohit and Materzynska, Joanna and Fiotto-Kaufman, Jaden and Bau, David

ICCV 2023. [PDF] [CODE]

Ablating concepts in text-to-image diffusion models

Kumari, Nupur and Zhang, Bingliang and Wang, Sheng-Yu and Shechtman, Eli and Zhang, Richard and Zhu, Jun-Yan

ICCV 2023. [PDF] [CODE]

Unified concept editing in diffusion models

Gandikota, Rohit and Orgad, Hadas and Belinkov, Yonatan and Materzy{'n}ska, Joanna and Bau, David

WACV 2024. [PDF] [CODE]

Editing implicit assumptions in text-to-image diffusion models

Orgad, Hadas and Kawar, Bahjat and Belinkov, Yonatan

ICCV 2023. [PDF] [CODE]

Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion Models

Sanghyun Kim, Seohyeon Jung, Balhae Kim, Moonseok Choi, Jinwoo Shin, Juho Lee

ICML 2023 Workshop on Challenges in Deployable Generative AI. [PDF] [CODE]

Degeneration-tuning: Using scrambled grid shield unwanted concepts from stable diffusion

Ni, Zixuan and Wei, Longhui and Li, Jiacheng and Tang, Siliang and Zhuang, Yueting and Tian, Qi

ACM MM 2023. [PDF]

ReFACT: Updating Text-to-Image Models by Editing the Text Encoder

Dana Arad, Hadas Orgad, Yonatan Belinkov

NAACL 2024. [PDF]

Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models

Gong Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, Humphrey Shi

CVPR 2024 [PDF] [CODE]

One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications

Mengyao Lyu, Yuhong Yang, Haiwen Hong, Hui Chen, Xuan Jin, Yuan He, Hui Xue, Jungong Han, Guiguang Ding

CVPR 2024. [PDF] [CODE]

Selective Amnesia: A Continual Learning Approach to Forgetting in Deep Generative Models

Alvin Heng , Harold Soh

NeurIPS 2024, [PDF] [CODE]

All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models

Hong, Seunghoo and Lee, Juhun and Woo, Simon S

AAAI 2024. [PDF]

SafeGen: Mitigating Unsafe Content Generation in Text-to-Image Models

Xinfeng Li , Yuchen Yang , Jiangyi Deng, Chen Yan , Yanjiao Chen , Xiaoyu Ji , Wenyuan Xu

ACM CCS 2024. [PDF] [CODE]

Direct Unlearning Optimization for Robust and Safe Text-to-Image Models

Yong-Hyun Park, Sangdoo Yun, Jin-Hwa Kim, Junho Kim, Geonhui Jang, Yonghyun Jeong, Junghyo Jo, Gayoung Lee

ICML 2024 Workshop. [PDF]

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Poppi, Samuele and Poppi, Tobia and Cocchi, Federico and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita

ECCV 2024. [PDF] [CODE]

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

Yongliang Wu, Shiji Zhou, Mingzhuo Yang, Lianzhe Wang, Wenbo Zhu, Heng Chang, Xiao Zhou, Xu Yang

arxiv 2024. [PDF]

R.A.C.E.: Robust Adversarial Concept Erasure for Secure Text-to-Image Diffusion Model

Changhoon Kim, Kyle Min, Yezhou Yang

ECCV 2024. [PDF] [CODE]

Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers

Chi-Pin Huang, Kai-Po Chang, Chung-Ting Tsai, Yung-Hsuan Lai, Fu-En Yang, Yu-Chiang Frank Wang

ECCV 2024. [PDF] [CODE]

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models

Yimeng Zhang, Xin Chen, Jinghan Jia, Yihua Zhang, Chongyu Fan, Jiancheng Liu, Mingyi Hong, Ke Ding, Sijia Liu

arxiv 2024. [PDF] [CODE]

Editing Massive Concepts in Text-to-Image Diffusion Models

Tianwei Xiong, Yue Wu, Enze Xie, Yue Wu, Zhenguo Li, Xihui Liu

arxiv 2024. [PDF] [CODE]

🍎Inference Guidance

Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models

Patrick Schramowski, Manuel Brack, Björn Deiseroth, Kristian Kersting

CVPR 2023. [PDF] [CODE]

Sega: Instructing text-to-image models using semantic guidance

Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, Kristian Kersting

NeurIPS 2023. [PDF] [CODE]

Self-discovering interpretable diffusion latent directions for responsible text-to-image generation

Li, Hang and Shen, Chengzhi and Torr, Philip and Tresp, Volker and Gu, Jindong

CVPR 2024. [PDF] [CODE]

Resources

This part provides commonly used datasets and tools in AD-on-T2IDM.

Datasets

Based on the prompt source, existing datasets are categorized into two types: clean and adversarial datasets. The clean dataset consists of clean prompts that are not attacked and typically crafted by human, while the adversarial dataset comprises adversarial prompts generated by attack methods. Moreover, according to the category of prompts involved in the dataset, existing clean datasets are further divided into two types: non-malicious and malicious datasets. The non-malicious dataset contains non-malicious prompts, while the malicious dataset contains explicitly malicious prompts. In this section, we will introduce several non-malicious, malicious, and adversarial datasets, respectively.

Non-Malicious Datasets

$\textit{ImageNet}$, which contains images describing 1,000 categories of common objects in the real world, is a significant benchmark in the field of computer vision. As a result, some works craft clean datasets based on the category information in ImageNet. For instance, ATM employs a standardized template: "A photo of {CLASS_NAME}" to generate clean prompts, where "{CLASS_NAME}" denotes the class name in ImageNet.
$\textit{MSCOCO}$ [Link]is a cross-modal image-text dataset, a popular benchmark for training and evaluating text-to-image generation models. Specifically, MSCOCO includes 82,783 training images and 40,504 testing images, each with 5 text descriptions.
$\textit{LAION-COCO}$ [Link] is a subset of LAION-5B, which is a large-scale image-text dataset in the real world. LAION-COCO includes 600 million images and corresponding text descriptions.
$\textit{DiffusionDB}$ [Link] is a large-scale text-to-image prompt dataset, which contains 14 million images generated by Stable Diffusion using prompts from real users.

Malicious Datasets

$\textit{Unsafe Diffusion}$ [Link] provides 30 manually crafted malicious prompts that describe sexual and bloody content, as well as political figures.
$\textit{SneakyPrompt}$ [Link] uses ChatGPT to automatically generate 200 malicious prompts that involve sexual and bloody content.
$\textit{I2P}$ [Link] comprises 4,703 inappropriate prompts, encompassing hate, harassment, violence, self-harm, nudity content, shocking images, and illegal activity. These inappropriate prompts are real-user inputs sourced from an image generation website, Lexica [Link].
$\textit{MMA}$ [Link]samples and releases 1,000 malicious prompts from LAION-COCO based on an NSFW~(Not Safe for Work) score. These malicious prompts mainly focus on sexual content.
$\textit{Image Synthesis Style Studies Database}$ [Link] compiles thousands of artists whose styles can be replicated by various text-to-image models, such as Stable Diffusion and Midjourney.
$\textit{MACE}$ [Link] provides a dataset comprising 200 celebrities whose portraits, generated using SD v1.4, are recognized with remarkable accuracy (>99%) by the GIPHY Celebrity Detector (GCD) [Link].
$\textit{ViSU}$ [Link] contains 175k pairs of safe and unsafe data examples. Each example consists of: (1) a safe sentence, (2) a corresponding safe image, (3) an NSFW sentence that is semantically correlated with the safe sentence, and (4) a corresponding NSFW image.

Adversarial Datasets

$\textit{Adversarial Nibbler Dataset}$ [Link] consists of 3,412 adversarial prompts that effectively bypass safeguards while inducing text-to-image models to generate malicious images. These prompts, which include violent, sexual, biased, and hate-based material, are manually crafted during the Adversarial Nibbler Challenge.
$\textit{MMA}$ [Link] targets 1,000 malicious prompts, generating 1,000 corresponding adversarial prompts using the proposed attack method. These adversarial prompts primarily focus on sexual content.
$\textit{Zhang et al.}$ [Link] target 10 objects as malicious concepts and generates 500 adversarial prompts for each object. These adversarial prompts are capable of inducing the text-to-image model to produce images related to the malicious concepts, even when the prompt excludes words directly related to them.

Tools

We provide several detectors for detecting malicious prompts and images.

Malicious Prompt Detector

NSFW_text_classifier: [Link]
distilbert-nsfw-text-classifier: [Link]
Detoxify: [Link]
Openai-Moderation: [Link] (API)
Azure-Moderation: [Link] (API)

Malicious Image Detector

Q16: [Link]
CLIP-based-NSFW-detector: [Link]
Multi-headed Safety Classifier: [Link]
NSFW_image_detection: [Link]
GIPHY Celebrity Detector: [Link]
Azure-Moderation: [Link] (API)

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
picture		picture
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-Attacks and Defenses on T2I Diffusion Models

🔔News

Citation

Content

Abstract

Overview of AD-on-T2IDM

Two key concerns in T2IDM: Robustness and Safety

Adversarial attacks

Defenses

😀Paper List

👿Adversarial Attacks

💥Untargeted Attacks

😾White-Box Attacks

🙈Black-Box Attacks

💢Targeted Attacks

🌀Attacking External Safeguards

🐍Attacking Internal Safeguards

🐸Attacking Black-Box Safeguards

💊Defenses for Improving Safety

🏄External Safeguards

🚵Prompt Classifier

🏇Prompt Transformation

🍔Internal Safeguards

🍟Model Editing

🍎Inference Guidance

Resources

Datasets

Non-Malicious Datasets

Malicious Datasets

Adversarial Datasets

Tools

Malicious Prompt Detector

Malicious Image Detector

About

Releases

Packages

datar001/Awesome-AD-on-T2IDM

Folders and files

Latest commit

History

Repository files navigation

Awesome-Attacks and Defenses on T2I Diffusion Models

🔔News

Citation

Content

Abstract

Overview of AD-on-T2IDM

Two key concerns in T2IDM: Robustness and Safety

Adversarial attacks

Defenses

😀Paper List

👿Adversarial Attacks

💥Untargeted Attacks

😾White-Box Attacks

🙈Black-Box Attacks

💢Targeted Attacks

🌀Attacking External Safeguards

🐍Attacking Internal Safeguards

🐸Attacking Black-Box Safeguards

💊Defenses for Improving Safety

🏄External Safeguards

🚵Prompt Classifier

🏇Prompt Transformation

🍔Internal Safeguards

🍟Model Editing

🍎Inference Guidance

Resources

Datasets

Non-Malicious Datasets

Malicious Datasets

Adversarial Datasets

Tools

Malicious Prompt Detector

Malicious Image Detector

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages