Yelp Rating Prediction

The Yelp dataset is a subset of their businesses, reviews, and user data for use in personal, educational, and academic purposes. The dataset contains around 5,200,000 reviews,74,000 businesses,100,000 pictures,10 metropolitan areas, 100,000 tips by 1,300,000 users and 1.2 million business attributes like hours, parking, availability, and ambience.

Aggregated check-ins over time for each of the 174,000 businesses The goal of this project was to predict reviews' star ratings on Yelp using the review text. We built the following models that perform text analysis on review data to predict the rating stars.

Feature Selection

Basic without any Filtering
Stop Word Removal
Stemming using Snowball Stemmer

Machine Learning Algorithms used

Logistic Regression
Support Vector Machine
Naive Bayes

Data and Preprocessing

"Yelp Dataset Challenge” dataset has been selected to study in this research. The Yelp dataset has been published to be studied on photo classification, graph mining and natural language processing & sentiment analysis.A python script is implemented to parse the reviews JSON data file. During the parsing process, only star ratings and text reviews are taken into consideration, all the other information is ignored. The raw data is stored in three different dictionaries on the basis of review, sentiments and stars. In the data pre-processing phase, the entire text is converted into lowercase to reduce redundancy in subsequent feature selection. Several regular expressions are used, followed by the removal of punctuations

Feature Selection

In the scope of this research, a unique feature set is built based on the user text reviews. In addition to this, some variations to our process are implemented: (1) With no pre-processing or changes (2) Removing English stop words (i.e. extremely common words) from the feature set using the stop word removal feature available in Natural Language Toolkit (NLTK) Corpus (3) Stemming (i.e. reducing a word to its stem/root form) to remove repetitive features using the Snowball Stemmer algorithm which is a built-in feature in NLTK. and white spaces from the review text. Accordingly, for the first basic sentiment analysis a simple rule is considered, if the star rating is greater 3 than 3 value 1.0 is assigned which s inferred as a ”Positive” sentiment and otherwise it was assigned 0.0 for ”Negative” sentiment.

Three different machine learning algorithms are implemented and examined: Naive Bayes, SVM and Logistic Regression.

Results

Naive Bayes

Multinomianal-Naive Bayes is evaluated on 100,000 instances. The results are represented with precision, recall and f1-score metrics. First, polarity of the reviews are observed (Fig. 2). Then same methods are implemented on 5 classes which represent 5 stars (Fig. 3). The results are observed relatively high for 2 classes polarity evaluation. However, a significant decrease is observed in the results for 5 classes. This inference can be based on the fact that lexicons with 4 and 5 stars are relatively close and lexicons with rating of 1,2 and 3 are relatively close.

Support Vector Machine

Support Vector Machines is a discriminative classifier formally defined by a separating hyperplane. The algorithm outputs an optimal hyperplane which categorizes new incoming instances, given labeled training data.

Logistic Reg3ession

The outcome is measured with a dichotomous variable (in this case, t4o to five possible outcomes). The goal was to find the best fitting model to describe the relationship between the dichotomous characteristic of interest (reviews) and a set of independent variables.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
image		image
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yelp Rating Prediction

Feature Selection

Machine Learning Algorithms used

Data and Preprocessing

Feature Selection

Results

Naive Bayes

Support Vector Machine

Logistic Reg3ession

About

Releases

Packages

Languages

License

someaditya/yelp-rating-prediction

Folders and files

Latest commit

History

Repository files navigation

Yelp Rating Prediction

Feature Selection

Machine Learning Algorithms used

Data and Preprocessing

Feature Selection

Results

Naive Bayes

Support Vector Machine

Logistic Reg3ession

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages