SC4002 Natural Language Processing Group Assignment - Sentiment Classification

All TensorBoard logs and model checkpoints can be found here and here.

Introduction

This project is a group assignment for the SC4002 Natural Language Processing course, focusing on sentiment classification using various machine learning models. The implementation includes RNNs, LSTMs, GRUs, CNNs, and Transformers, along with techniques for handling out-of-vocabulary (OOV) words. The dataset used is the rotten tomato dataset.

Setup Instructions

Create a Python Virtual Environment and Activate It

Navigate to the root directory of the project and execute the following commands:
```
python -m venv .venv
source .venv/bin/activate  # On Linux/macOS
```
Note for Windows users:
```
.\.venv\Scripts\activate
```
Install Required Packages

Install all necessary packages using the requirements.txt file:
```
pip install -r requirements.txt
```
Run Jupyter Notebook

We utilized Jupyter Notebooks to explain and display our outputs interactively. To learn how to run a Jupyter Notebook, refer to the official documentation.

Project Structure

Jupyter Notebooks

part0.ipynb: Downloading the dataset.
part1.ipynb: Answers Part 1 questions (preparing word embeddings and mitigating OOV).
part2.ipynb: Answers Part 2 questions (RNN model).
part3a.ipynb: Answers Part 3a questions (RNN model with trainable embeddings).
part3b_generate_embedding.ipynb: Generates a new embedding matrix with mitigated OOV.
part3b.ipynb: Answers Part 3b questions (RNN model with trainable embeddings and mitigated OOV).
part3c_biLSTM.ipynb: Answers Part 3c question on biLSTM.
part3c_biGRU.ipynb: Answers Part 3c question on biGRU.
part3d.ipynb: Answers Part 3d questions (CNN model).
part3e.ipynb: Answers Part 3e questions (Transformer models).
part3f.ipynb: Answers Part 3f questions (model comparison).

Utilities

utils/analytics.py: Contains code for loading TensorBoard log files and uploading data to Weights & Biases (WandB).
utils/text.py: Provides functions for tokenizing and preprocessing text data, as well as computing average context embeddings.
utils/train.py: Includes training routines for the models, handling logic such as early stopping and logging.

Models

models/RNN.py: Implementation of the RNN/biLSTM/biGRU model with PyTorch Lightning wrappers for training, validation, and testing.
models/CNN.py: Implementation of the CNN model with PyTorch Lightning wrappers.
models/MetaModel.py: Implementation of the ensemble meta model.
models/embedding_matrix.npy: Embedding matrix based on GoogleNews300 Word2Vec.
models/index_from_word.json: A mapping from words to their corresponding indices in the embedding matrix.
models/embedding_matrix_oov.npy: Embedding matrix based on GoogleNews300 Word2Vec with OOV words filled with contextual average.
models/index_from_word_oov.json: A mapping from words to their corresponding indices in the embedding matrix with OOV words filled with contextual average.

Additional Scripts

part3e_transformers.py: A pipeline script to train, evaluate, and test Transformer models for Part 3e.
- Example Usage: python part3e_transformers.py --model roberta
part3e_ensemble.py: A script to train and evaluate ensemble models for Part 3e.
scripts/xxx.py: A script to train certain models using the terminal instead of jupyter notebook (optional, you may run the jupyter notebook directly to train the model instead).
- Example Usage: python scripts/train_cnn.py

Code Used for Each Part

Part	Files and Scripts Used
Part 1	- `utils/text.py` - `part1.ipynb`
Part 2	- `utils/text.py` - `utils/train.py` - `utils/analytics.py` - `models/RNN.py` - `part2.ipynb`
Part 3a	- `utils/text.py` - `utils/train.py` - `utils/analytics.py` - `models/RNN.py` - `part3a.ipynb`
Part 3b	- `utils/text.py` - `utils/train.py` - `utils/analytics.py` - `models/RNN.py` - `part3b.ipynb`
Part 3c	- `utils/text.py` - `utils/train.py` - `utils/analytics.py` - `models/RNN.py` - `part3c_biLSTM.ipynb` - `part3c_biGRU.ipynb`
Part 3d	- `utils/text.py` - `utils/train.py` - `utils/analytics.py` - `models/CNN.py` - `part3d.ipynb`
Part 3e	- `utils/text.py` - `utils/train.py` - `utils/analytics.py` - `part3e_transformers.py` - `part3e_ensemble.py` - `models/MetaModel.py` - `part3e.ipynb`
Part 3f	- `part3f.ipynb`

Running TensorBoard

To run TensorBoard logs to view the model training and validation graphs, you can follow these instructions:

Download the logs

Download our TensorBoard logs here.

Launch TensorBoard

Place the TensorBoard logs under the tb_logs/ directory. To view these logs, run the following command:

tensorboard --logdir=tb_logs/rnn # for rnn
tensorboard --logdir=tb_logs/rnn_trainable_embeddings_and_contextual_oov # for rnn_trainable_embeddings_and_contextual_oov
tensorboard --logdir=tb_logs/cnn # for cnn
# ...

Open TensorBoard in Your Browser

After running the command, TensorBoard will display a URL (usually http://localhost:6006/). Open this link in your browser to access the training and validation metrics, including loss and accuracy for each model.

Reproducing Results with Checkpoints

To reproduce the results using the provided checkpoints, you can follow these instructions:

Download Checkpoints Download our checkpoints here. The checkpoints saved together with the TensorBoard logs and only the top 10 model checkpoints are provided. Alternatively, you may download the best model checkpoints that we have collated here.
Run the Jupyter Notebook

Then go to the respective jupyter notebook and start running the cells after the model training portion and view the results.

Directory Layout

Your directory should look like this:

root_dir/
├── .venv/                          
├── best_model_predictions/         
├── models/                         
│   ├── CNN.py                      
│   ├── RNN.py                      
│   ├── MetaModel.py                
│   ├── embedding_matrix.npy        
│   ├── embedding_matrix_oov.npy    
│   ├── index_from_word.json        
│   ├── index_from_word_oov.json    
│   ├── word2vec-google-news-300    
│   └── word2vec-google-news-300.vectors.npy
├── scripts/                        
│   ├── train_bigru.py              
│   ├── train_cnn.py                
│   ├── train_lstm.py               
│   ├── train_rnn_part_2.py         
│   ├── train_rnn_part_3a.py        
│   └── ...        
├── tb_logs/                        
│   ├── rnn/                        
│   ├── rnn_trainable_embeddings/                        
│   ├── rnn_trainable_embeddings_and_contextual_oov/                        
│   ├── cnn/                        
│   ├── bigru/                        
│   ├── bilstm/                        
│   ├── transformers/                        
│   └── ...                        
├── utils/                          
│   ├── analytics.py                
│   ├── text.py                     
│   └── train.py                    
├── .gitignore                      
├── requirements.txt                
├── README.md                       
├── part0.ipynb                 
├── part1.ipynb                 
├── part2.ipynb                 
├── part3a.ipynb                
├── part3b_generate_embedding.ipynb
├── part3b.ipynb                
├── part3c_biGRU.ipynb          
├── part3c_biLSTM.ipynb         
├── part3d.ipynb                
├── part3e.ipynb                
├── part3e_ensemble.py          
├── part3e_transformers.py      
└── part3f.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SC4002 Natural Language Processing Group Assignment - Sentiment Classification

Table of Contents

Introduction

Setup Instructions

Project Structure

Jupyter Notebooks

Utilities

Models

Additional Scripts

Code Used for Each Part

Running TensorBoard

Reproducing Results with Checkpoints

Directory Layout

About

Releases

Packages

Contributors 6

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
models		models
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
part0.ipynb		part0.ipynb
part1.ipynb		part1.ipynb
part2.ipynb		part2.ipynb
part3a.ipynb		part3a.ipynb
part3b.ipynb		part3b.ipynb
part3b_generate_embedding.ipynb		part3b_generate_embedding.ipynb
part3c_biGRU.ipynb		part3c_biGRU.ipynb
part3c_biLSTM.ipynb		part3c_biLSTM.ipynb
part3d.ipynb		part3d.ipynb
part3e.ipynb		part3e.ipynb
part3e_ensemble.py		part3e_ensemble.py
part3e_transformers.py		part3e_transformers.py
part3f.ipynb		part3f.ipynb
requirements.txt		requirements.txt

xJQx/sc4002-nlp-sentiment-classification

Folders and files

Latest commit

History

Repository files navigation

SC4002 Natural Language Processing Group Assignment - Sentiment Classification

Table of Contents

Introduction

Setup Instructions

Project Structure

Jupyter Notebooks

Utilities

Models

Additional Scripts

Code Used for Each Part

Running TensorBoard

Reproducing Results with Checkpoints

Directory Layout

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages