Open source models for NLP: https://huggingface.co/models. Import data from CSV file.
-
Data Loading and Exploration:
- The dataset contains Amazon product reviews, and only the first 500 rows are used.
- An initial exploratory data analysis (EDA) visualizes the distribution of review scores.
-
Sentiment Analysis with VADER:
- VADER (Valence Aware Dictionary and sEntiment Reasoner) is used to analyze the sentiment of each review.
- Each review is scored with four metrics:
- Negative (neg): Represents the proportion of the text that is negative.
- Neutral (neu): Represents the proportion of the text that is neutral.
- Positive (pos): Represents the proportion of the text that is positive.
- Compound: A normalized score that summarizes the overall sentiment; ranges from -1 (most negative) to 1 (most positive).
-
Sentiment Analysis with Roberta:
- The project also uses the Roberta pretrained model, which is a transformer-based model, to provide a more context-aware sentiment analysis.
- Roberta generates similar scores for each review:
- roberta_neg: Probability of the text being negative.
- roberta_neu: Probability of the text being neutral.
- roberta_pos: Probability of the text being positive.
-
Comparison of Results:
- The project compares VADER and Roberta results, highlighting how each model scores the sentiment.
- Visualization of the sentiment scores by review star ratings is done using bar plots.
This project effectively demonstrates how different sentiment analysis models can be applied to text data, providing insights into customer feedback by quantifying the sentiment of their reviews.
- Python: The primary programming language used for data manipulation, analysis, and visualization.
- Pandas: For data loading, cleaning, and manipulation.
- Matplotlib & Seaborn: For creating visualizations to explore and compare sentiment scores.
- VADER (Valence Aware Dictionary and sEntiment Reasoner): A lexicon and rule-based sentiment analysis tool that is specifically tuned for social media text. It provides sentiment scores directly for each review.
- Hugging Face Transformers Library: Used to implement the Roberta model for sentiment analysis, allowing for a more context-aware understanding of the reviews.
-
Sentiment Analysis:
- VADER Sentiment Analysis:
- Provides quick and easy sentiment scoring by evaluating each word in the text against a predefined lexicon.
- Outputs four scores: positive, neutral, negative, and compound.
- Roberta Sentiment Analysis:
- Uses a pre-trained transformer-based model that understands the context of words in the text.
- Outputs probabilities for each sentiment class (positive, neutral, and negative).
- VADER Sentiment Analysis:
-
Data Visualization:
- Bar Plots and Distribution Plots: Used to compare sentiment scores across different star ratings, providing insights into the relationship between the ratings and the sentiment scores.
- Side-by-Side Comparisons: Visualizations comparing VADER and Roberta results help in understanding how different models interpret the sentiment of the same text.
- Data Loading: The first 500 rows of the Amazon reviews dataset are loaded for analysis.
- Preprocessing: Basic cleaning and preparation of the text data, if required.
- Sentiment Scoring:
- VADER: Applied directly to the text data to obtain sentiment scores.
- Roberta: Text data is tokenized and passed through the pre-trained Roberta model to obtain sentiment probabilities.
- Visualization and Analysis:
- Sentiment scores are analyzed and visualized to understand their distribution and correlation with star ratings.
- Comparison between VADER and Roberta to highlight differences in sentiment scoring.
Step 0. Read in Data and NLTK (Natural Language Toolkit)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
plt.style.use('ggplot')
Link with the data CSV file:
df = pd.read_csv('../20230709-Python-Sentiment-Analysis-Project-with-NLTK/Reviews.csv')
df = df.head(500)
print(df.shape)
This step should run for check the result
Result show:
Then check Table result by:
df.head()
Quick EDA
df['Score'].value_counts().sort_index().plot(kind='bar', title='Count of Reviews by Starts', figsize=(8, 5))
NLTK NLTK for generate postag:
Step 1. VADER Seniment Score
Use NLTK's SentimentIntensityAnalyzer to get the neg/neu/pos scores of the reviews.
VADER_lexicon for NLTK sentiment:
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm
sia = SentimentIntensityAnalyzer()
#Run the polarity score on the entire dataset
res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
text = row['Text']
myid = row['Id']
res[myid] = sia.polarity_scores(text)
Run for check neg/neu/pos DATAFRAME:
vaders = pd.DataFrame(res).T
vaders = vaders.reset_index().rename(columns={'index': 'Id'})
vaders = vaders.merge(df, how='left')
Run for check table of neg/neu/pos
Plot VADER results:
*I try to change axis y=Compound but found some error so i use axis y=compound instead. (can someone advise to me pls?)
Step 3.Roberta Pretrained Model
Step 3. Run for Roberta Model
Step 4. Review Examples: Identify Positive 1-Star and Negative 5-Star Reviews and look at some examples where the model scoring and review score differ the most. Start with check table of result
Extra: The Transformers Pipeline
Requirment:
punkt
averaged_perceptron_tagger
maxent_ne_chunker
words
vader_lexicon
Xformers
Note: This project working on Lenovo 2in1 ideapad miix 520, window 10, Azure machine learning workspace.