After trying TensorFlow and PyTorch, I want to utilize PySpark which is good for massive dataset. The project uses PySpark for efficient data preprocessing and PyTorch for building and training the deep learning models.
This project performs sentiment analysis on Amazon Automotive product reviews. It classifies reviews as positive or negative based on the review text, demonstrating the use of distributed computing for natural language processing tasks. my purpose is to implement and compare three deep learning models for the analysis:
- RNN LSTM
- CNN
- RNN LSTM + CNN
- Data preprocessing using PySpark
- Text feature extraction using TF-IDF
- Binary classification using Logistic Regression
- Model evaluation using Area Under ROC and Accuracy metrics
While TensorFlow and PyTorch are excellent choices for deep learning models, we chose PySpark for this project for several reasons:
-
Scalability: PySpark is designed to handle very large datasets that may not fit into the memory of a single machine. It can distribute data processing across a cluster of computers, making it ideal for big data scenarios.
-
Integrated Analytics: PySpark provides a unified engine for large-scale data processing and machine learning. It allows us to perform data loading, preprocessing, model training, and evaluation all within the same framework.
-
Simplified ML Pipeline: PySpark's MLlib offers a high-level API for building machine learning pipelines, making it easier to assemble and tune ML workflows.
-
Performance for Large Datasets: For very large datasets, PySpark can outperform single-machine solutions by leveraging distributed computing resources.
-
Business Reality: In many business scenarios, simple models that can process vast amounts of data quickly are more valuable than complex models that take longer to train and deploy.
-
Easy Integration: If this project needs to be integrated into a larger data processing ecosystem (e.g., Hadoop ecosystem), PySpark makes this integration seamless.
While deep learning models (like those built with TensorFlow or PyTorch) could potentially achieve higher accuracy for this task, the PySpark solution offers a good balance of performance, scalability, and simplicity, especially when dealing with large-scale text data.
- Machine: MacBook with M3 Pro chip (MPS device used for GPU acceleration)
- Python version: 3.9
- Main libraries: PySpark 3.4.1, PyTorch 1.9.0+
- Clone the repository
- Create a virtual environment:
conda create -n <environment name> python=3.9 conda activate <environment name>
- Install the required packages:
pip install -r requirements.txt
- Python 3.7+
- PySpark 3.4.1
- PyArrow 12.0.1
- Java 8 or later
- Torch
amazon_reviews_sentiment/
│
├── data/
│ └── reviews_Automotive_5.json.gz
│
├── src/
│ ├── __init__.py
│ ├── data_processing.py
│ ├── model.py
│ └── utils.py
│
├── main.py
├── requirements.txt
└── README.md
-
Make sure your
reviews_Automotive_5.json.gz
file is in thedata/
directory. -
Run the main script:
python main.py
After training with early stopping (maximum 10 epochs, patience of 3):
-
RNN LSTM (Trained for 8 epochs):
- Train Loss: 0.0983, Train Acc: 0.9689
- Val Loss: 0.4567, Val Acc: 0.8515
- Test Loss: 0.3527, Test Accuracy: 0.8688
-
CNN (Trained for 5 epochs):
- Train Loss: 0.0071, Train Acc: 0.9998
- Val Loss: 0.4333, Val Acc: 0.8704
- Test Loss: 0.3312, Test Accuracy: 0.8718
-
RNN LSTM + CNN (Trained for 5 epochs):
- Train Loss: 0.0288, Train Acc: 0.9914
- Val Loss: 0.6986, Val Acc: 0.8352
- Test Loss: 0.3360, Test Accuracy: 0.8425
The training process took approximately 74.32 seconds to run.
The following plot shows the training and validation accuracy and loss for all three models over the training epochs:
-
Accuracy (left plot):
- All models show improvement in both training and validation accuracy over time.
- The CNN model (green) shows the fastest increase in training accuracy, reaching near 100% quickly.
- The RNN LSTM model (blue) shows a more gradual increase in accuracy for both training and validation.
- The RNN LSTM + CNN model (red) shows a pattern similar to CNN but with slightly lower validation accuracy.
-
Loss (right plot):
- All models show a decrease in both training and validation loss over time.
- The CNN model's training loss decreases very rapidly, almost reaching zero.
- The RNN LSTM model shows a more gradual decrease in both training and validation loss.
- The RNN LSTM + CNN model's loss decrease pattern is similar to CNN but with higher validation loss.
-
Overfitting:
- The gap between training and validation metrics (especially for CNN and RNN LSTM + CNN) suggests some degree of overfitting.
- The RNN LSTM model seems to have the least overfitting, with training and validation metrics staying closer together.
-
Early Stopping:
- The plots show where early stopping occurred for each model, as indicated by the end of each line.
- CNN and RNN LSTM + CNN stopped earlier than RNN LSTM, likely due to the early stopping mechanism detecting potential overfitting.
These visualizations support my numerical results and provide insights into the learning dynamics of each model throughout the training process.
- All models achieved good performance, with test accuracies around 84-87%.
- The CNN model showed the fastest convergence and best performance, reaching near-perfect training accuracy by the 5th epoch. However, the gap between training and validation accuracy suggests some overfitting.
- The RNN LSTM model showed more stable performance, with less overfitting than the CNN model.
- The combined RNN LSTM + CNN model didn't outperform the individual models, suggesting that the additional complexity might not be beneficial for this task.
- Early stopping was effective in preventing overfitting, with all models stopping before the maximum number of epochs.
- The CNN model achieved the highest test accuracy, followed closely by the RNN LSTM model.
Initially, I attempted to use TensorFlow for this project. However, I encountered persistent crashes related to protobuf compatibility issues on the M3 Mac. As a result, I switched to PyTorch, which provided better compatibility and performance on this machine.
It's worth noting that these issues might be specific to the M3 Mac architecture, and users with different hardware configurations might not encounter the same problems with TensorFlow.
- Fine-tune hyperparameters for each model to potentially improve performance.
- Experiment with pre-trained word embeddings like Word2Vec or GloVe.
- Implement cross-validation for more robust evaluation.
- Explore attention mechanisms to potentially improve the RNN LSTM model.
- Analyze misclassified examples to gain insights for further improvements.
- Investigate ways to reduce overfitting in the CNN model.
- Experiment with different architectures for the combined RNN LSTM + CNN model to see if its performance can be improved.