Sentiment Analysis Suite

This repo contains a suite of python script to train your own ensemble model for sentiment analysis. The ensemble model is made up of a SVM with Naive Bayes features (nbsvm), a RNN with LSTM gates(lstm), a RNN with GRU gates and word vector embedding layers(gru), and a Microsoft's Gradient Boosted Tree - LightGBM (lgb).

Model scripts

Training

python model_train.py --option_arguments

Train the 4 individual models using the given training data, saves the model files into the given model folder

Data Format

The dataset is a csv file with two columns, label and text. Where label is the sentiment tags.

label	text
-1	I hate pie!
1	I love pie!

Training Arguments

General Arguments

--num_classes: Number of classes in your dataset
--val_split: Proportion of data to use for validation during training
--path_data: Path to data in csv format (e.g. /home/data/my_data/training.csv)
--path_embs: Path to word embedding to be used in the model
--directory: Directory to save model and log files to
--skip_nbsvm: Do not train the nbsvm model
--skip_lstm: Do not train the lstm model
--skip_gru: Do not train the gru model
--skip_lgb: Do not train the lgb model

Model specific arguments

LSTM

--emb_size_lstm: Size of the LSTM embedding layer
--epochs_lstm: Number of max LSTM epochs to train
--recurrent_size_lstm: Size of the recurrent layer for LSTM
--batch_size_lstm: training batch size for LSTM
--max_feat_lstm: Max number of word features for LSTM

GRU

--emb_size_gru: Size of the GRU embedding layer
--epochs_gru: Number of max GRU epochs to train
--recurrent_size_gru: Size of the recurrent layer for GRU
--batch_size_gru: training batch size for GRU
--max_feat_gru: Max number of word features for GRU

Prediction

python model_predict.py <file_path(type: csv)> <x_col_name> <weight_path> <out_file_path>

Use the trained models to predict the given dataset, and save the results as csv

Data Format

text
I dislike pie.

Prediction arguments

file_path: path to data in csv format (e.g. /home/data/my_data/to_predict.csv)
x_col_name: text columns to predict (e.g. text or user_message)
weight_path: where the models are saved (directory of training script)
out_file_path: where to store the prediction result

Install instruction

clone the repo:

git clone https://github.com/edwardcqian/sentiment_analysis_suite.git && cd sentiment_analysis_suite

Install the requirements: For GPU

pip install -r requirements_gpu.txt

For CPU

pip install -r requirements.txt

download the glove word vector file (WARNING: file is 1.42GB)

wget http://nlp.stanford.edu/data/glove.twitter.27B.zip

extract to root directory of repo

unzip -j "glove.twitter.27B.zip" "glove.twitter.27B.200d.txt"

Note: The word embedding used is trained using twitter data, there are several different sets available, some may be better suited for your data.

Test setup

Use the movie review kaggle dataset to ensure correct setup.

Download the Kaggle dataset here (a Kaggle account is required)
Create a directory called test in the root directory of the repo. Extract the train.tsv file into the test folder
Setup data: python movie_review_setup.py
Train model: python model_train.py --path_data test/train_data.csv --path_emb ~/Documents/work/climate/glove.twitter.27B.200d.txt --num_classes 5 --label Sentiment --text Phrase --max_feat_gru 15000 --epochs_lstm 20 --epochs_gru 20
Predict model: python model_predict.py test/test_data.csv Phrase Model test/pred.csv
check results: python move_review_accuracy.py

Expected output

0.6794181724977573
             precision    recall  f1-score   support

          0       0.70      0.22      0.33       732
          1       0.58      0.54      0.56      2796
          2       0.72      0.86      0.79      7845
          3       0.63      0.57      0.60      3364
          4       0.66      0.31      0.43       869

avg / total       0.67      0.68      0.66     15606

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
utils		utils
weights		weights
.gitignore		.gitignore
README.md		README.md
get_url.py		get_url.py
model_predict.py		model_predict.py
model_train.py		model_train.py
movie_review_accuracy.py		movie_review_accuracy.py
movie_review_setup.py		movie_review_setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis Suite

Model scripts

Training

Data Format

Training Arguments

General Arguments

Model specific arguments

Prediction

Data Format

Prediction arguments

Install instruction

Test setup

Expected output

About

Releases

Packages

Languages

edwardcqian/sentiment_analysis_suite

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis Suite

Model scripts

Training

Data Format

Training Arguments

General Arguments

Model specific arguments

Prediction

Data Format

Prediction arguments

Install instruction

Test setup

Expected output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages