Kapoor, Kartik 2462
Nham, Bryan 2494
Panyala, Sukrutha 8740
Vohra, Vedant 2889
amazon_reviews_grocery.tsv
/ amazon_reviews_grocery_100k.tsv
Column filtering, mapping categorical values to numerical, text cleaning, stop-word removal, lemmatization, tokenization, converting to vectors and creating word embeddings
Bi-directional LSTM w/ Attention
- A model that only uses the review text as input
- A model that uses the review text as well as some of the other numerical/categorical features as input (
helpful_votes, total_votes, vine, verified_purchase
)
-> The additional features are added as a second Input layer which is concatenated with the output from the LSTM, just before the Dense layers.
Accuracy metric + confusion matrix. Compared with fine-tuned BERT model (Refer Model comparison section for details)
- Install dependencies
pip3 install -r requirements.txt
- Install nltk data
python3 -m nltk.downloader stopwords wordnet omw-1.4
- Run pyspark pre-processing job to generate datasets + embeddings for training and testing (stored in
data/
)
spark-submit preprocess.py <absolute_file_path>
(absolute_file_path
path here can be a hdfs path like: "hdfs://10.0.1.111:9000/amazon_reviews_grocery_100k.tsv"
)
- Train models (optional, as trained models are already present in
models/
. Skip to next step)
python3 train.py
- Evaluate models
python3 test.py
We have evaluated this model against the state-of-the-art Transformer-based model, BERT, trained via transfer-learning and fine-tuned on our dataset.
This approach doesn't require any pre-processing on the text, since BERT does it's own tokenization and embedding.
The complete code for our Bi-directional LSTM model and the BERT model as well as the evaluation of those models can be found in the notebook hw3.ipynb
(or access it here: https://github.com/JediRhymeTrix/COSC-6339-HW3/blob/master/hw3.ipynb)