This project focuses on building a sentiment analysis model to predict the sentiment of customer reviews using DistilBERT. The pipeline involves scraping reviews, preprocessing the data, training a fine-tuned DistilBERT model, and deploying it through a Flask application. The workflow is automated using Apache Airflow.
-
DistilBERT is a smaller, faster, and more efficient version of BERT. It was developed through a process called knowledge distillation, where a smaller model (the student) is trained to reproduce the behavior of a larger model (the teacher) while retaining most of its performance.
-
DistilBERT retains 97% of BERT's performance on various NLP tasks while being 40% smaller and 60% faster.
-
BERT consists of 12 transformer layers, an embedding layer and a prediction layer.
-
During distillation, DistilBERT learns from BERT by mimicking its outputs (logits) and intermediate representations.
-
As a resilt, DistilBERT has 6 transformer layers (half of BERT's 12 layers) while maintaining similar functionality.
-
Data Collection:
- Scraped 20,000+ Amazon reviews using Selenium WebDriver.
- Stored the collected data securely in an AWS S3 bucket.
-
Data Preprocessing:
- Cleaned and processed reviews and ratings:
- Removed non-English Reviews
- Removed stopwords, special characters, and extra spaces.
- Performed lemmatization and stemming to normalize text.
- Encoded ratings ranging from 0-5 stars to labels negative(0), neutral(1) and positive(2)
- Stored the processed data in S3 for further use.
- Cleaned and processed reviews and ratings:
-
Data Tokenization and Preparation:
- Tokenized reviews using DistilBERT tokenizer.
- Divided the dataset into training, validation, and testing sets.
-
Model Training:
- Imported DistilBERT from Hugging Face Transformers.
- Fine-tuned the model on the dataset.
- Implemented early stopping to optimize training.
- Used customized class weights to handle class imbalance in training dataset.
-
Deployment:
-
Workflow Automation:
- Accuracy: Achieved an accuracy of 85% on the test dataset.
- F1-score: Achieved f1-scores of 0.9 for positive class and 0.7 for others on the test dataset.
- Enhanced multi-class classification performance by implementing class-weighted training, improving the Precision-Recall AUC for minority classes (e.g., 'neutral') by over 30%, resulting in more balanced and accurate predictions across all categories
To get started with this project, follow these steps:
-
Clone the repository:
git clone https://github.com/SathvikNayak123/sentiment-anyalysis
-
Install the necessary dependencies:
pip install -r requirements.txt
-
Add AWS S3 credentials in .env file
-
Run Airflow to execute scraping and training pipeline:
astro dev init astro dev start
-
Run app for prediction
python app.py