This is the code accompanying our submission to SemEval-2021 Task 5.
For a detailed description of the technical details and experimental results, please refer to our paper:
Lone Pine at SemEval-2021 Task 5: Fine-Grained Detection of Hate Speech Using BERToxic
Yakoob Khan, Weicheng Ma, Soroush Vosoughi
Dartmouth College
This paper describes our approach to the Toxic Spans Detection problem (SemEval-2021 Task 5). We propose BERToxic, a system that fine-tunes a pre trained BERT model to locate toxic text spans in a given text and utilizes additional post-processing steps to refine the boundaries. The post-processing steps involve (1) labeling character offsets between consecutive toxic tokens as toxic and (2) assigning a toxic label to words that have at least one token labeled as toxic. Through experiments, we show that these two postprocessing steps improve the performance of our model by 4.16% on the test set. We also studied the effects of data augmentation and ensemble modeling strategies on our system. Our system significantly outperformed the provided baseline and achieved an F1 score of 0.683, placing Lone Pine in the 17th place out of 91 teams in the competition.
Here's a link to a sample Google Colab Jupyter Notebook that utilizes this code for toxic spans detection.
To fine-tune the BERToxic system for the toxic spans detection, enter the following:
python3 './train_bert.py'
--model_type 'bert-base-cased' \
--train_dir '../data/tsd_train.csv' \
--dev_dir '../data/tsd_trial.csv' \
--test_dir '../data/tsd_test.csv' \
--epochs 2 \
--warm_up_steps 500 \
--learning_rate 5e-5 \
--weight_decay 0.01 \
--batch_size 16
All experiments were ran on Google Colab Pro's High-RAM environment using a single P100 GPU. See requirements for complete list of all dependencies used and their respective versions.
The mt-dnn code was obtained from Liu et al. and the task_organizers_code was obtained from Pavlopoulos et al..
The remaining code in this repository was developed by Yakoob Khan.
If you find this code or our paper useful, please consider citing:
@misc{khan2021lone,
title={Lone Pine at SemEval-2021 Task 5:
Fine-Grained Detection of Hate Speech Using BERToxic},
author={Yakoob Khan and Weicheng Ma and Soroush Vosoughi},
year={2021},
eprint={2104.03506},
archivePrefix={arXiv},
primaryClass={cs.CL}
}