This repository contains annotated data on inappropriate language in online discussions, generated through a combination of expert annotation, crowd-sourcing, and ChatGPT-based methods.
ChatGPT_explicit: This subfolder contains annotations of explicit inappropriate language identified by ChatGPT.
ExplicitlyInappropriateLanguageInContext: Here, you will find both crowd and expert annotations that highlight instances of explicitly inappropriate language.
Includes scripts and code used for data processing, analysis, etc.
Holds the raw and processed data used for annotation and analysis. This includes input data in various formats and intermediate data sets generated during processing.
Contains files related to the LingoTurk platform, which was used for collecting annotations. This includes task configurations and instructions.
Includes statistical reports and summaries derived from the data set.
Contains detailed analyses of annotation results, including comparisons between different annotation methods, inter-annotator agreements, error analysis, and insights into annotation discrepancies.
Researchers and developers interested in content moderation, natural language processing, and online discourse analysis can benefit from this data set and associated resources.
If you use this data set or findings from this repository in your research or projects, please consider citing this repository and our paper.
Citing the paper:
Citing the repository: https://github.com/cltl/InappropriateLanguageDetection
Please feel free to ask any questions you may have by contacting me via b[dot]barbarestani[at]vu[dot]nl.