-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT: Violent Durian Attack Strategy #398
FEAT: Violent Durian Attack Strategy #398
Conversation
Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
Co-authored-by: Roman Lutz <romanlutz13@gmail.com>
Hello @romanlutz, Thank you for your feedback! The requested changes have been made. Please feel free to review them again, and let me know if further adjustments are needed—I'll be happy to make them. :) |
You need to rerun |
Thank you, done! |
Overview
This PR integrates the Violent Durian attack strategy from Project Moonshot into PyRIT. The strategy involves an attacker LLM manipulating the Target LLM into adopting a criminal persona and providing illegal advice.
However, an issue has been identified where the attacker LLM denies the initial prompt, likely due to content moderation policies on the LLM. This PR focuses on adapting the attack strategy while acknowledging the limitations of the attacker LLM.
Work Completed
YAML:
Conversation Objective:
Issue: Attacker LLM Denial:
Testing the Attack on Uncensored Models:
Next Steps
Test with Uncensored Models:
Refinement of Scorer Feedback:
Related Issue
Contributes to #376, focused on adding attack modules from Project Moonshot.