This repository contains package for solving sentiment prediction problem using Python and Tensorflow.
GoEmotions dataset was used for modelling and analysis. This is a human annotated dataset extracted from Reddit comments. Dataset contains 27 emotions + neutral. Follow the link to see more.
This repository contains notebooks training classifier bert_model.py
based on BERT model.
Models solve multiclass, multilabel classification problem. See notebook
bert_model_v0.7.1.ipynb
High-quality data is important for good modelling. In standardize.py
I created
text standardization layer for tensorflow, performs cleaning of chat/comments-styled
texts in English. This layer is integrated into the model (and can be used in
other models), so that model works directly with raw texts. See below list
of applied reformats:
- fix unicode chars (unidecode lib)
- replace contractions with full words
- fix streched letters in words (e.g. soooooo, youuuuuuuu)
- replace chat language with full phrases (e.g. lol, asap)
- remove placeholders used in GoEmotions dataset (e.g. [NAME], [RELIGION])
- remove words containing numbers
- remove /r tags used in reddit comments (GoEmotions source)
- remove all charactes except for letters, some punctuation and hyphen
- replace duplicated punctuation with a single char
- propper-set punctuation without space before and with 1 space after
- remove multiple spaces and trim
- convert to lowercase
In utils.py
you can find various handy functions used for modelling and analysis.
You can use notebooks as examples to develop your own models. See requirements.txt
for list of required libs.
Using small_bert/bert_en_uncased_L-2_H-128_A-2
classifier reached the following
metrics in emotions prediction:
F1-Score (micro): 0.5835
F1-Score (macro): 0.5070
and in sentiments prediction:
F1-Score (micro): 0.7760
F1-Score (macro): 0.7349