Sentiment Analysis on IMDb Movie Reviews.
IMDb - IMDb (Internet Movie Database) is an online database of information related to films, television programs, home videos and video games, and internet streams, including cast, production crew and personnel biographies, plot summaries, trivia, and fan reviews and ratings.
This is a Kaggle Competition: Bag of Words Meets Bags of Popcorn.
You can install dependencies by running the following command in Anaocnda prompt:
# Theano
conda install mingw libpython
conda install mkl=2017.0.3
# Keras
pip install keras
We also need NLTK(Natural Language ToolKit) package. It is already installed in Anaconda.
If it is not installed, you can install it by running the commands in Anaconda prompt:
# NLTK
conda install -c conda-forge nltk
After downloading NLTK package, we need to download NLTK dataset.
import nltk
nltk.download()
This window will pop-up.
Download All Packages.
There are Two datasets - 1) labeledTrainData 2) testData.
These datasets have been downloaded from Kaggle Competition - Bags of Words Meets Bags Of Popcorn.
LabeledTrainData has 25000 rows containing 3 columns - id, Sentiment, review.
TestData has 25000 rows containing only 2 columns - id, and reviews. We have to predict the sentiments of these reviews.
Sentiment Analysis of IMDb Movie datasets is done using two different machine learning algorithm:
- Random forest
- Recurrent Neural Network.
First, we trained the model using Random Forest. The score on kaggle comes out to be 0.84176.
We also trained the model on LSTM and GRU Recurrent Neural Network, using different preprocessing techniques, like Porter stemming, Stop words etc. It gives training accuracy in range of 91.57 to 92.76, and score on Kaggle comes out in the range of 0.86768 to 0.87896.
The highest score on Kaggle comes out to be 0.87896 using Recurrent Neural Network LSTM out of different algorithms and various pre-processing techniques.
- Change the directory, in read_csv(), to location of your labeledTrainData.tsv.
- Change the directory, in read_csv(), to location of your testData.tsv.
- Run the file.
-
Change the directory of data_train and data_test of ''Sentiment Analysis using RNN' to the location of respective dataset.
-
Run the file.
-
First, it will ask for the input for methods of preprocessing the data - which are - Porter Stemming, Stop Wrods, or Neither of them. Accordingly, it will process the data.
-
Then, it will ask for the input for model - LSTM RNN or GRU RNN.
-
Compile the model, and it will create a csv file for the predicted sentiment of test data.
-
Now, to predict your own review, run 'Predict Class For IMDb Movie Review.py'.
-
It will ask for which model to use, which methods of preprocessing to use, and then it will predict the sentiment of the review.
It will create a CSV file of predicted data for Kaggle submission, containing two columns: id, and sentiment.
id will be the column "id" from testdata, and sentiment will be the predicted value from the model.
To know more about Recurrent Neural Network, check this course.
Read more about Sentiment Analysis using Deep Learning methods in this paper by Lei Zhang(LinkedIn Corporation), Shuai Wang and Bing Liu.