Skip to content

Kanangnut/Python-Sentiment-Analysis-with-NLTK-Transformers-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 

Repository files navigation

Python Sentiment Analysis Project with Natural Language Toolkit (NLTK) for Classify Amazon Reviews

Credit

Open source models for NLP: https://huggingface.co/models. Import data from CSV file.

Project Overview:

  1. Data Loading and Exploration:

    • The dataset contains Amazon product reviews, and only the first 500 rows are used.
    • An initial exploratory data analysis (EDA) visualizes the distribution of review scores.
  2. Sentiment Analysis with VADER:

    • VADER (Valence Aware Dictionary and sEntiment Reasoner) is used to analyze the sentiment of each review.
    • Each review is scored with four metrics:
      • Negative (neg): Represents the proportion of the text that is negative.
      • Neutral (neu): Represents the proportion of the text that is neutral.
      • Positive (pos): Represents the proportion of the text that is positive.
      • Compound: A normalized score that summarizes the overall sentiment; ranges from -1 (most negative) to 1 (most positive).
  3. Sentiment Analysis with Roberta:

    • The project also uses the Roberta pretrained model, which is a transformer-based model, to provide a more context-aware sentiment analysis.
    • Roberta generates similar scores for each review:
      • roberta_neg: Probability of the text being negative.
      • roberta_neu: Probability of the text being neutral.
      • roberta_pos: Probability of the text being positive.
  4. Comparison of Results:

    • The project compares VADER and Roberta results, highlighting how each model scores the sentiment.
    • Visualization of the sentiment scores by review star ratings is done using bar plots.

This project effectively demonstrates how different sentiment analysis models can be applied to text data, providing insights into customer feedback by quantifying the sentiment of their reviews.

Tools and Technical Aspects:

1. Tools Used:

  • Python: The primary programming language used for data manipulation, analysis, and visualization.
  • Pandas: For data loading, cleaning, and manipulation.
  • Matplotlib & Seaborn: For creating visualizations to explore and compare sentiment scores.
  • VADER (Valence Aware Dictionary and sEntiment Reasoner): A lexicon and rule-based sentiment analysis tool that is specifically tuned for social media text. It provides sentiment scores directly for each review.
  • Hugging Face Transformers Library: Used to implement the Roberta model for sentiment analysis, allowing for a more context-aware understanding of the reviews.

2. Technical Aspects:

  • Sentiment Analysis:

    • VADER Sentiment Analysis:
      • Provides quick and easy sentiment scoring by evaluating each word in the text against a predefined lexicon.
      • Outputs four scores: positive, neutral, negative, and compound.
    • Roberta Sentiment Analysis:
      • Uses a pre-trained transformer-based model that understands the context of words in the text.
      • Outputs probabilities for each sentiment class (positive, neutral, and negative).
  • Data Visualization:

    • Bar Plots and Distribution Plots: Used to compare sentiment scores across different star ratings, providing insights into the relationship between the ratings and the sentiment scores.
    • Side-by-Side Comparisons: Visualizations comparing VADER and Roberta results help in understanding how different models interpret the sentiment of the same text.

3. Process:

  • Data Loading: The first 500 rows of the Amazon reviews dataset are loaded for analysis.
  • Preprocessing: Basic cleaning and preparation of the text data, if required.
  • Sentiment Scoring:
    • VADER: Applied directly to the text data to obtain sentiment scores.
    • Roberta: Text data is tokenized and passed through the pre-trained Roberta model to obtain sentiment probabilities.
  • Visualization and Analysis:
    • Sentiment scores are analyzed and visualized to understand their distribution and correlation with star ratings.
    • Comparison between VADER and Roberta to highlight differences in sentiment scoring.

Project Step

Step 0. Read in Data and NLTK (Natural Language Toolkit)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk

plt.style.use('ggplot')

Link with the data CSV file:
df = pd.read_csv('../20230709-Python-Sentiment-Analysis-Project-with-NLTK/Reviews.csv')
df = df.head(500)
print(df.shape)
This step should run for check the result
Result show:
image

Then check Table result by:
df.head()
image

Quick EDA
df['Score'].value_counts().sort_index().plot(kind='bar', title='Count of Reviews by Starts', figsize=(8, 5))
image

image

image

NLTK NLTK for generate postag:
image

NLTK for generate entitle:
image

Step 1. VADER Seniment Score
Use NLTK's SentimentIntensityAnalyzer to get the neg/neu/pos scores of the reviews.

VADER_lexicon for NLTK sentiment:
from nltk.sentiment import SentimentIntensityAnalyzer
from tqdm.notebook import tqdm
sia = SentimentIntensityAnalyzer()

image

#Run the polarity score on the entire dataset
res = {}
for i, row in tqdm(df.iterrows(), total=len(df)):
text = row['Text']
myid = row['Id']
res[myid] = sia.polarity_scores(text)

Run for check neg/neu/pos DATAFRAME:
image

vaders = pd.DataFrame(res).T
vaders = vaders.reset_index().rename(columns={'index': 'Id'})
vaders = vaders.merge(df, how='left')

Run for check table of neg/neu/pos
image

Plot VADER results:
*I try to change axis y=Compound but found some error so i use axis y=compound instead. (can someone advise to me pls?)
image image

Step 3.Roberta Pretrained Model

image

image

image

Step 3. Run for Roberta Model

Check input_ids first: image

image

image

image

image

image

image

Compare Scores between model image

image

Step 4. Review Examples: Identify Positive 1-Star and Negative 5-Star Reviews and look at some examples where the model scoring and review score differ the most. Start with check table of result

image

image

image

Extra: The Transformers Pipeline image

image

Requirment:
punkt
averaged_perceptron_tagger
maxent_ne_chunker
words
vader_lexicon
Xformers

Note: This project working on Lenovo 2in1 ideapad miix 520, window 10, Azure machine learning workspace.

About

Sentiment Analysis for Classify Amazon Reviews

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published