This repo contains a suite of python script to train your own ensemble model for sentiment analysis. The ensemble model is made up of a SVM with Naive Bayes features (nbsvm), a RNN with LSTM gates(lstm), a RNN with GRU gates and word vector embedding layers(gru), and a Microsoft's Gradient Boosted Tree - LightGBM (lgb).
python model_train.py --option_arguments
Train the 4 individual models using the given training data, saves the model files into the given model folder
The dataset is a csv file with two columns, label and text. Where label is the sentiment tags.
label | text |
---|---|
-1 | I hate pie! |
1 | I love pie! |
--num_classes
: Number of classes in your dataset--val_split
: Proportion of data to use for validation during training--path_data
: Path to data in csv format (e.g./home/data/my_data/training.csv
)--path_embs
: Path to word embedding to be used in the model--directory
: Directory to save model and log files to--skip_nbsvm
: Do not train the nbsvm model--skip_lstm
: Do not train the lstm model--skip_gru
: Do not train the gru model--skip_lgb
: Do not train the lgb model
LSTM
--emb_size_lstm
: Size of the LSTM embedding layer--epochs_lstm
: Number of max LSTM epochs to train--recurrent_size_lstm
: Size of the recurrent layer for LSTM--batch_size_lstm
: training batch size for LSTM--max_feat_lstm
: Max number of word features for LSTM
GRU
--emb_size_gru
: Size of the GRU embedding layer--epochs_gru
: Number of max GRU epochs to train--recurrent_size_gru
: Size of the recurrent layer for GRU--batch_size_gru
: training batch size for GRU--max_feat_gru
: Max number of word features for GRU
python model_predict.py <file_path(type: csv)> <x_col_name> <weight_path> <out_file_path>
Use the trained models to predict the given dataset, and save the results as csv
text |
---|
I dislike pie. |
file_path
: path to data in csv format (e.g./home/data/my_data/to_predict.csv
)x_col_name
: text columns to predict (e.g.text
oruser_message
)weight_path
: where the models are saved (directory
of training script)out_file_path
: where to store the prediction result
- clone the repo:
git clone https://github.com/edwardcqian/sentiment_analysis_suite.git && cd sentiment_analysis_suite
- Install the requirements: For GPU
pip install -r requirements_gpu.txt
For CPU
pip install -r requirements.txt
- download the glove word vector file (WARNING: file is 1.42GB)
wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
- extract to root directory of repo
unzip -j "glove.twitter.27B.zip" "glove.twitter.27B.200d.txt"
Note: The word embedding used is trained using twitter data, there are several different sets available, some may be better suited for your data.
Use the movie review kaggle dataset to ensure correct setup.
- Download the Kaggle dataset here (a Kaggle account is required)
- Create a directory called
test
in the root directory of the repo. Extract thetrain.tsv
file into thetest
folder - Setup data:
python movie_review_setup.py
- Train model:
python model_train.py --path_data test/train_data.csv --path_emb ~/Documents/work/climate/glove.twitter.27B.200d.txt --num_classes 5 --label Sentiment --text Phrase --max_feat_gru 15000 --epochs_lstm 20 --epochs_gru 20
- Predict model:
python model_predict.py test/test_data.csv Phrase Model test/pred.csv
- check results:
python move_review_accuracy.py
0.6794181724977573
precision recall f1-score support
0 0.70 0.22 0.33 732
1 0.58 0.54 0.56 2796
2 0.72 0.86 0.79 7845
3 0.63 0.57 0.60 3364
4 0.66 0.31 0.43 869
avg / total 0.67 0.68 0.66 15606