Natural language processing project based on the one-million-posts dataset.
More than 3.000 user comments are written each day on www.derstandard.at. Moderators need to review these comments regarding several aspects like inappropriate language, discriminating content, off-topic comments, questions that need to be answered, and more.
We provide machine learning models that detect potentially problematic comments to ease the moderators' daily work.
- Install pyenv.
- Install python 3.8.5 via
pyenv install 3.8.5
- Run
make setup
.
For further instructions on how to run our code see SETUP.md.
The presentations are found in ./presentations/
Presentation file | Description |
---|---|
OneMillionPosts-GraduationEvent.pdf | Presentation of the graduation event from April 28, 2021 |
OneMillionPosts-Midterm.pdf | Midterm presentation of the project from April 12, 2021 |
OneMillionPosts-AnnotationComposition.pdf | EDA concerning ticket #24, #25 |
The models' code is found in ./modeling/
in this repo.
They are pushed as .py
files. See SETUP.md.
Model | Description |
---|---|
gbert Classifier | German BERT base |
Zero Shot Classifier | xlm-roberta-large-xnli |
XGBoost | XGBoost |
Logistic Regression | Logistic Regresssion |
Support Vector Classifier | Support Vector Classifier |
Random Forest Classifier | Random Forest Classifier |
Naive Bayes Classifier | Naive Bayes Classifier |
LightGBM | LightGBM algorithm not considered for further modeling |
The notebooks are found in ./notebooks/
.
They are pushed as .py
files. See SETUP.md.