By Parth Mistry
- Contains 50K movie reviews for natural language processing or Text analytics.
- This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets.
- We have a set of 25,000 highly polar movie reviews for training and 25,000 for testing.
- So, predict the number of positive and negative reviews using either classification or deep learning algorithms.
- Here we will be using Logistic Regression to classify the reviews.
Python Version: 3.7
Packages: pandas, numpy, sklearn, nltk, pickle
Dataset: IMDB Movie Reviews
- Transforming Documents to Feature Vectors
- Checking word relevancy using TF-IDF
- Calculating TF-IDF of each term
- Removing noisy data
- Tokenization of documents
- Transforming Text Data into TF-IDF Vectors
- Document Classsification using Logistic regression
LogisticRegressionCV(cv=5,
scoring='accuracy',
random_state=0,
n_jobs=-1,
verbose=3,
max_iter=300)
- Here, I used Logistic Regression on the cleaned data, and it was trained with 89% of accuracy classifying movie reviews.
Parth Mistry © 2020