-
Notifications
You must be signed in to change notification settings - Fork 425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Multiple Tasks] FEAT add attack modules from moonshot #376
Comments
I will take on the colloquial wordswap attack and mark it completed on the task list once completed👍 |
I will "attack" Textfooler and Textbugger. WIll mark it completed once done. |
Hi @eugeniavkim , I would like to work on I also took a look at the Here’s the link to the files I mentioned: Toxic Sentence Generator. Looking forward to your thoughts!
|
@KutalVolkan go ahead! Which files are unsafe? |
Hello Roman, Here’s the link and the screenshot I mentioned regarding the unsafe files: Toxic Sentence Generator on Hugging Face. |
Hello @romanlutz, A few additional questions:
Submitting separate PRs might allow for more focused reviews and quicker feedback on each converter, but I'll defer to your preference on how you'd like to handle it.
This contrasts with a standard converter that mostly modifies the input prompt. In this case, Violent Durian seems to guide a multi-turn conversation, progressively influencing the LLM to respond unethically and act in alignment with the persona. For example, I plan to integrate this behavior into the Red Teaming Orchestrator by dynamically selecting a criminal persona and applying it to the conversation objective in the YAML-based attack strategy, adapting the YAML to fit the Violent Durian use case. If you have a different approach or best practices to suggest, I’d be happy to incorporate them. Looking forward to your thoughts 😀 |
Yes, individual PRs are preferable, unless you're reusing pieces. Even then it's probably better to have them one after the other. Your idea to use it on the orchestrator level makes sense. Essentially, this would be a new custom attack strategy. |
Good question... I have not used them before, but this sounds suspicious. Maybe it's because they're binary? |
Marking this with
may be good first issues to tackle. |
@visirion07 - are you still planning on taking a look at Textfooler and Textbugger? 😄 |
Yes @nina-msft. Sorry got held up in some other work. Taking this up as a high priority. WIll post an ETA soon |
Is your feature request related to a problem? Please describe.
Adding in attack modules from Project Moonshot that can be adapted as converters under
pyrit.prompt_converter
Describe the solution you'd like
Directly porting over the technique from
attack-modules
from https://github.com/aiverify-foundation/moonshot-data?tab=readme-ov-file#attack-modulesIn order to prevent duplicate work, we can use this task list below to check off completed attack modules as well as commenting on which attack you are working on adapting into PyRIT.
The text was updated successfully, but these errors were encountered: