The official code to reproduce results in the NACCL2019 paper: White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks
The code is divided into sub-packages:
1. ./Agents - adversarial learned attck generators
2. ./Attacks - optimization attacks like hot flip
3. ./Toxicity Classifier - a classifier of sentences toxic/non toxic
4. ./Data - data handling
5. ./Resources - resources for other categories
As seen in the figure below we train a classifier to predict the class of toxic and non-toxic sentences.
We attack this model using a white-box algorithm called hot-flip and distill the knowledge into a second model - DistFlip
.
DistFlip
is able to generate attacks in a black-box manner.
These attacks generalize well to the Google Perspective algorithm (tested Jan 2019).
We used the data from this kaggle challenge by Jigsaw
For data flip using HotFlip+ you can download
the data from Google Drive
and unzip it into: ./toxic_fool/resources/data
The number of flips needed to change the label of a sentences using the original white box algorithm and ours (green)