Single-pass Detection of Jailbreaking Input in Large Language Models

This repository is the original implementation of the paper "Single-pass Detection of Jailbreaking Input in Large Language Models" TMLR 2025.

The codebase is built on JailbreakBench and extends its functionality to our jailbreaking defense method SPD.

Setup

Step 1: Install JailbreakBench

Follow the installation instructions from the original JailbreakBench repository:

conda create -n spd python=3.10
cd jailbreakbench
pip install jailbreakbench[vllm]

Step 2: Install additional libraries

cd ..
pip install -r requirements.txt

Usage

In data folder, we share a huge successfull jailbreaking dataset we generated for 4 models from 3 different attack (GCG, AutoDAN and PAIR) with more than 4000 samples in total.
You can run the evaluate.ipynb to save and load logit values, train a classifier and test on your data.

Cite as:

@article{
candogan2025singlepass,
title={Single-pass Detection of Jailbreaking Input in Large Language Models},
author={Leyla Naz Candogan and Yongtao Wu and Elias Abad Rocamora and Grigorios Chrysos and Volkan Cevher},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=42v6I5Ut9a},
note={}
}

If you use the dataset in your work, please also consider citing its constituent sources of the attacks:

@misc{zou2023universal,
  title={Universal and Transferable Adversarial Attacks on Aligned Language Models},
  author={Andy Zou and Zifan Wang and J. Zico Kolter and Matt Fredrikson},
  year={2023},
  eprint={2307.15043},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
@inproceedings{liu2023autodan,
    title={Generating Stealthy Jailbreak Prompts on Aligned Large Language Models},
    author={Xiaogeng Liu and Nan Xu and Muhao Chen and Chaowei Xiao},
    booktitle={International Conference on Learning Representations (ICLR)},
    year={2024},
    url={https://openreview.net/forum?id=7Jwpw4qKkb}
}
@misc{chao2023jailbreaking,
      title={Jailbreaking Black Box Large Language Models in Twenty Queries}, 
      author={Patrick Chao and Alexander Robey and Edgar Dobriban and Hamed Hassani and George J. Pappas and Eric Wong},
      year={2023},
      eprint={2310.08419},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
@inproceedings{tdc2023,
  title={TDC 2023 (LLM Edition): The Trojan Detection Challenge},
  author={Mantas Mazeika and Andy Zou and Norman Mu and Long Phan and Zifan Wang and Chunru Yu and Adam Khoja and Fengqing Jiang and Aidan O'Gara and Ellie Sakhaee and Zhen Xiang and Arezoo Rajabi and Dan Hendrycks and Radha Poovendran and Bo Li and David Forsyth},
  booktitle={NeurIPS Competition Track},
  year={2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Single-pass Detection of Jailbreaking Input in Large Language Models

Setup

Step 1: Install JailbreakBench

Step 2: Install additional libraries

Usage

Cite as:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
SPD		SPD
data		data
jailbreakbench		jailbreakbench
README.md		README.md
evaluate.ipynb		evaluate.ipynb
requirements.txt		requirements.txt

LIONS-EPFL/SPD

Folders and files

Latest commit

History

Repository files navigation

Single-pass Detection of Jailbreaking Input in Large Language Models

Setup

Step 1: Install JailbreakBench

Step 2: Install additional libraries

Usage

Cite as:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages