Sentiment Analysis with PySpark and PyTorch

NLP using combination of PySpark and PyTorch

After trying TensorFlow and PyTorch, I want to utilize PySpark which is good for massive dataset. The project uses PySpark for efficient data preprocessing and PyTorch for building and training the deep learning models.

Project Description

This project performs sentiment analysis on Amazon Automotive product reviews. It classifies reviews as positive or negative based on the review text, demonstrating the use of distributed computing for natural language processing tasks. my purpose is to implement and compare three deep learning models for the analysis:

RNN LSTM
CNN
RNN LSTM + CNN

Key Features:

Data preprocessing using PySpark
Text feature extraction using TF-IDF
Binary classification using Logistic Regression
Model evaluation using Area Under ROC and Accuracy metrics

Why PySpark?

While TensorFlow and PyTorch are excellent choices for deep learning models, we chose PySpark for this project for several reasons:

Scalability: PySpark is designed to handle very large datasets that may not fit into the memory of a single machine. It can distribute data processing across a cluster of computers, making it ideal for big data scenarios.
Integrated Analytics: PySpark provides a unified engine for large-scale data processing and machine learning. It allows us to perform data loading, preprocessing, model training, and evaluation all within the same framework.
Simplified ML Pipeline: PySpark's MLlib offers a high-level API for building machine learning pipelines, making it easier to assemble and tune ML workflows.
Performance for Large Datasets: For very large datasets, PySpark can outperform single-machine solutions by leveraging distributed computing resources.
Business Reality: In many business scenarios, simple models that can process vast amounts of data quickly are more valuable than complex models that take longer to train and deploy.
Easy Integration: If this project needs to be integrated into a larger data processing ecosystem (e.g., Hadoop ecosystem), PySpark makes this integration seamless.

While deep learning models (like those built with TensorFlow or PyTorch) could potentially achieve higher accuracy for this task, the PySpark solution offers a good balance of performance, scalability, and simplicity, especially when dealing with large-scale text data.

Environment

Machine: MacBook with M3 Pro chip (MPS device used for GPU acceleration)
Python version: 3.9
Main libraries: PySpark 3.4.1, PyTorch 1.9.0+

Setup and Installation

Clone the repository

Create a virtual environment:

conda create -n <environment name> python=3.9
conda activate <environment name>

Install the required packages:
```
pip install -r requirements.txt
```

Requirements

Python 3.7+
PySpark 3.4.1
PyArrow 12.0.1
Java 8 or later
Torch

Project Structure

amazon_reviews_sentiment/
│
├── data/
│   └── reviews_Automotive_5.json.gz
│
├── src/
│   ├── __init__.py
│   ├── data_processing.py
│   ├── model.py
│   └── utils.py
│
├── main.py
├── requirements.txt
└── README.md

Make sure your reviews_Automotive_5.json.gz file is in the data/ directory.
Run the main script:
```
python main.py
```

Results

After training with early stopping (maximum 10 epochs, patience of 3):

RNN LSTM (Trained for 8 epochs):
- Train Loss: 0.0983, Train Acc: 0.9689
- Val Loss: 0.4567, Val Acc: 0.8515
- Test Loss: 0.3527, Test Accuracy: 0.8688
CNN (Trained for 5 epochs):
- Train Loss: 0.0071, Train Acc: 0.9998
- Val Loss: 0.4333, Val Acc: 0.8704
- Test Loss: 0.3312, Test Accuracy: 0.8718
RNN LSTM + CNN (Trained for 5 epochs):
- Train Loss: 0.0288, Train Acc: 0.9914
- Val Loss: 0.6986, Val Acc: 0.8352
- Test Loss: 0.3360, Test Accuracy: 0.8425

The training process took approximately 74.32 seconds to run.

Performance Comparison

The following plot shows the training and validation accuracy and loss for all three models over the training epochs:

Observations from the plot:

Accuracy (left plot):
- All models show improvement in both training and validation accuracy over time.
- The CNN model (green) shows the fastest increase in training accuracy, reaching near 100% quickly.
- The RNN LSTM model (blue) shows a more gradual increase in accuracy for both training and validation.
- The RNN LSTM + CNN model (red) shows a pattern similar to CNN but with slightly lower validation accuracy.
Loss (right plot):
- All models show a decrease in both training and validation loss over time.
- The CNN model's training loss decreases very rapidly, almost reaching zero.
- The RNN LSTM model shows a more gradual decrease in both training and validation loss.
- The RNN LSTM + CNN model's loss decrease pattern is similar to CNN but with higher validation loss.
Overfitting:
- The gap between training and validation metrics (especially for CNN and RNN LSTM + CNN) suggests some degree of overfitting.
- The RNN LSTM model seems to have the least overfitting, with training and validation metrics staying closer together.
Early Stopping:
- The plots show where early stopping occurred for each model, as indicated by the end of each line.
- CNN and RNN LSTM + CNN stopped earlier than RNN LSTM, likely due to the early stopping mechanism detecting potential overfitting.

These visualizations support my numerical results and provide insights into the learning dynamics of each model throughout the training process.

Key Takeaways

All models achieved good performance, with test accuracies around 84-87%.
The CNN model showed the fastest convergence and best performance, reaching near-perfect training accuracy by the 5th epoch. However, the gap between training and validation accuracy suggests some overfitting.
The RNN LSTM model showed more stable performance, with less overfitting than the CNN model.
The combined RNN LSTM + CNN model didn't outperform the individual models, suggesting that the additional complexity might not be beneficial for this task.
Early stopping was effective in preventing overfitting, with all models stopping before the maximum number of epochs.
The CNN model achieved the highest test accuracy, followed closely by the RNN LSTM model.

Note on TensorFlow vs. PyTorch

Initially, I attempted to use TensorFlow for this project. However, I encountered persistent crashes related to protobuf compatibility issues on the M3 Mac. As a result, I switched to PyTorch, which provided better compatibility and performance on this machine.

It's worth noting that these issues might be specific to the M3 Mac architecture, and users with different hardware configurations might not encounter the same problems with TensorFlow.

Future Work

Fine-tune hyperparameters for each model to potentially improve performance.
Experiment with pre-trained word embeddings like Word2Vec or GloVe.
Implement cross-validation for more robust evaluation.
Explore attention mechanisms to potentially improve the RNN LSTM model.
Analyze misclassified examples to gain insights for further improvements.
Investigate ways to reduce overfitting in the CNN model.
Experiment with different architectures for the combined RNN LSTM + CNN model to see if its performance can be improved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis with PySpark and PyTorch

NLP using combination of PySpark and PyTorch

Project Description

Key Features:

Why PySpark?

Environment

Setup and Installation

Requirements

Project Structure

Results

Performance Comparison

Observations from the plot:

Key Takeaways

Note on TensorFlow vs. PyTorch

Future Work

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
image		image
src		src
test		test
README.md		README.md
main.py		main.py
model_comparison.html		model_comparison.html
requirements.txt		requirements.txt

peeti-sriwongsanguan/NLP_Sentiment_Analysis_PySpark_PyTorch

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis with PySpark and PyTorch

NLP using combination of PySpark and PyTorch

Project Description

Key Features:

Why PySpark?

Environment

Setup and Installation

Requirements

Project Structure

Results

Performance Comparison

Observations from the plot:

Key Takeaways

Note on TensorFlow vs. PyTorch

Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages