Toxic Spans Detection (SemEval-2021 Task 5)

Introduction

This is the code accompanying our submission to SemEval-2021 Task 5.

For a detailed description of the technical details and experimental results, please refer to our paper:

Lone Pine at SemEval-2021 Task 5: Fine-Grained Detection of Hate Speech Using BERToxic

Yakoob Khan, Weicheng Ma, Soroush Vosoughi

Dartmouth College

Abstract

This paper describes our approach to the Toxic Spans Detection problem (SemEval-2021 Task 5). We propose BERToxic, a system that fine-tunes a pre trained BERT model to locate toxic text spans in a given text and utilizes additional post-processing steps to refine the boundaries. The post-processing steps involve (1) labeling character offsets between consecutive toxic tokens as toxic and (2) assigning a toxic label to words that have at least one token labeled as toxic. Through experiments, we show that these two postprocessing steps improve the performance of our model by 4.16% on the test set. We also studied the effects of data augmentation and ensemble modeling strategies on our system. Our system significantly outperformed the provided baseline and achieved an F1 score of 0.683, placing Lone Pine in the 17th place out of 91 teams in the competition.

Usage

Here's a link to a sample Google Colab Jupyter Notebook that utilizes this code for toxic spans detection.

To fine-tune the BERToxic system for the toxic spans detection, enter the following:

python3 './train_bert.py' 
  --model_type 'bert-base-cased' \
  --train_dir '../data/tsd_train.csv' \
  --dev_dir '../data/tsd_trial.csv' \
  --test_dir '../data/tsd_test.csv' \
  --epochs 2 \
  --warm_up_steps 500 \
  --learning_rate 5e-5 \
  --weight_decay 0.01 \
  --batch_size 16

Dependencies

All experiments were ran on Google Colab Pro's High-RAM environment using a single P100 GPU. See requirements for complete list of all dependencies used and their respective versions.

Attribution

The mt-dnn code was obtained from Liu et al. and the task_organizers_code was obtained from Pavlopoulos et al..

The remaining code in this repository was developed by Yakoob Khan.

Citation

If you find this code or our paper useful, please consider citing:

@misc{khan2021lone,
  title={Lone Pine at SemEval-2021 Task 5: 
  Fine-Grained Detection of Hate Speech Using BERToxic}, 
  author={Yakoob Khan and Weicheng Ma and Soroush Vosoughi},
  year={2021},
  eprint={2104.03506},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.vscode		.vscode
data		data
experiments		experiments
figures		figures
mt-dnn		mt-dnn
src		src
README.md		README.md
SemEval Paper.pdf		SemEval Paper.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toxic Spans Detection (SemEval-2021 Task 5)

Introduction

Abstract

Usage

Dependencies

Attribution

Citation

References

About

Releases

Packages

Languages

Yakoob-Khan/Toxic-Spans-Detection

Folders and files

Latest commit

History

Repository files navigation

Toxic Spans Detection (SemEval-2021 Task 5)

Introduction

Abstract

Usage

Dependencies

Attribution

Citation

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages