[Multiple Tasks] FEAT add attack modules from moonshot #376

eugeniavkim · 2024-09-18T03:31:30Z

eugeniavkim · 2024-09-18T17:17:31Z

I will take on the colloquial wordswap attack and mark it completed on the task list once completed👍

visirion07 · 2024-09-18T21:03:15Z

I will "attack" Textfooler and Textbugger. WIll mark it completed once done.

KutalVolkan · 2024-09-21T06:14:17Z

I would like to work on Malicious Question Generator and Violent Durian.

I also took a look at the Toxic Sentence Generator and noticed that 22 files have been flagged as unsafe. Just wanted to check with you—is it still safe to proceed with this model, or should we apply the same approach used in the Malicious Question Generator as an alternative?

Here’s the link to the files I mentioned: Toxic Sentence Generator.

Looking forward to your thoughts!

Malicious Question Generator

romanlutz · 2024-09-21T14:37:39Z

@KutalVolkan go ahead! Which files are unsafe?

KutalVolkan · 2024-09-22T05:37:14Z

@KutalVolkan go ahead! Which files are unsafe?

Hello Roman,

Here’s the link and the screenshot I mentioned regarding the unsafe files: Toxic Sentence Generator on Hugging Face.

KutalVolkan · 2024-09-22T06:21:49Z

Hello @romanlutz,

A few additional questions:

Should we create a PR for each converter individually, e.g., for the Malicious Question Generator, or should we wait until all the above attack modules from Project Moonshot are finished before submitting the PR?

Submitting separate PRs might allow for more focused reviews and quicker feedback on each converter, but I'll defer to your preference on how you'd like to handle it.

Regarding Violent Durian, I initially thought it would function more like a strategy inside the Red Teaming Orchestrator. Upon further review, I see that it operates more dynamically by convincing the LLM (prompt target) to take on a criminal persona. The setup involves a multiturn agent that manipulates the LLM into gradually adopting the identity of a criminal (e.g., Zodiac Killer, Ted Bundy) and generating responses as if it were that persona.

This contrasts with a standard converter that mostly modifies the input prompt. In this case, Violent Durian seems to guide a multi-turn conversation, progressively influencing the LLM to respond unethically and act in alignment with the persona.

For example, I plan to integrate this behavior into the Red Teaming Orchestrator by dynamically selecting a criminal persona and applying it to the conversation objective in the YAML-based attack strategy, adapting the YAML to fit the Violent Durian use case.

If you have a different approach or best practices to suggest, I’d be happy to incorporate them. Looking forward to your thoughts 😀

romanlutz · 2024-09-23T20:38:16Z

Yes, individual PRs are preferable, unless you're reusing pieces. Even then it's probably better to have them one after the other.

Your idea to use it on the orchestrator level makes sense. Essentially, this would be a new custom attack strategy.

romanlutz · 2024-09-23T21:01:00Z

@KutalVolkan go ahead! Which files are unsafe?

Hello Roman,

Here’s the link and the screenshot I mentioned regarding the unsafe files: Toxic Sentence Generator on Hugging Face.

Good question...

I have not used them before, but this sounds suspicious. Maybe it's because they're binary?
I suppose we could go back to the paper and check how they generated these but that could involve a lot of work.
Otherwise, I'm inclined to skip. Don't want to be responsible for making your machine unsafe 😆

nina-msft · 2024-10-02T21:37:27Z

Marking this with good first issue. The remaining tasks of:

Insert Punctuation Attack
Job Role Generator
Toxic Sentence Generator

may be good first issues to tackle.

nina-msft · 2024-10-02T21:38:20Z

@visirion07 - are you still planning on taking a look at Textfooler and Textbugger? 😄

visirion07 · 2024-10-02T23:18:03Z

Yes @nina-msft. Sorry got held up in some other work. Taking this up as a high priority. WIll post an ETA soon

This was referenced Sep 25, 2024

FEAT: Malicious Question Generator #397

Merged

FEAT: Violent Durian Attack Strategy #398

Merged

FEAT: Charswap Attack #403

Merged

eugeniavkim mentioned this issue Oct 1, 2024

FEAT: Colloquial Wordswap Attack #406

Merged

KutalVolkan mentioned this issue Oct 2, 2024

FEAT: Homoglyph Attack #407

Merged

nina-msft added converters Related to PyRIT converters enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Oct 2, 2024

eugeniavkim mentioned this issue Oct 2, 2024

FEAT: Generalize Colloquial Wordswap Attack Converter #418

Open

nina-msft changed the title ~~FEAT add attack modules from moonshot~~ [Multiple Tasks] FEAT add attack modules from moonshot Oct 3, 2024

This was referenced Oct 3, 2024

FEAT Moonshot Attack Module: Insert Punctuation Attack #426

Open

FEAT Moonshot Attack Module: Job Role Generator #427

Open

FEAT Moonshot Attack Module: Toxic Sentence Generator #428

Open

ghost mentioned this issue Oct 26, 2024

FEAT: Job Role Generator attack module from Project Moonshot #506

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Multiple Tasks] FEAT add attack modules from moonshot #376

[Multiple Tasks] FEAT add attack modules from moonshot #376

eugeniavkim commented Sep 18, 2024 •

edited by romanlutz

Loading

eugeniavkim commented Sep 18, 2024

visirion07 commented Sep 18, 2024

KutalVolkan commented Sep 21, 2024 •

edited

Loading

romanlutz commented Sep 21, 2024

KutalVolkan commented Sep 22, 2024

KutalVolkan commented Sep 22, 2024

romanlutz commented Sep 23, 2024

romanlutz commented Sep 23, 2024

nina-msft commented Oct 2, 2024

nina-msft commented Oct 2, 2024

visirion07 commented Oct 2, 2024 •

edited

Loading

[Multiple Tasks] FEAT add attack modules from moonshot #376

[Multiple Tasks] FEAT add attack modules from moonshot #376

Comments

eugeniavkim commented Sep 18, 2024 • edited by romanlutz Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

eugeniavkim commented Sep 18, 2024

visirion07 commented Sep 18, 2024

KutalVolkan commented Sep 21, 2024 • edited Loading

romanlutz commented Sep 21, 2024

KutalVolkan commented Sep 22, 2024

KutalVolkan commented Sep 22, 2024

romanlutz commented Sep 23, 2024

romanlutz commented Sep 23, 2024

nina-msft commented Oct 2, 2024

nina-msft commented Oct 2, 2024

visirion07 commented Oct 2, 2024 • edited Loading

eugeniavkim commented Sep 18, 2024 •

edited by romanlutz

Loading

KutalVolkan commented Sep 21, 2024 •

edited

Loading

visirion07 commented Oct 2, 2024 •

edited

Loading