Skip to content

xJQx/sc4002-nlp-sentiment-classification

Repository files navigation

SC4002 Natural Language Processing Group Assignment - Sentiment Classification

All TensorBoard logs and model checkpoints can be found here and here.

Table of Contents

Introduction

This project is a group assignment for the SC4002 Natural Language Processing course, focusing on sentiment classification using various machine learning models. The implementation includes RNNs, LSTMs, GRUs, CNNs, and Transformers, along with techniques for handling out-of-vocabulary (OOV) words. The dataset used is the rotten tomato dataset.

Setup Instructions

  1. Create a Python Virtual Environment and Activate It

    Navigate to the root directory of the project and execute the following commands:

    python -m venv .venv
    source .venv/bin/activate  # On Linux/macOS

    Note for Windows users:

    .\.venv\Scripts\activate
  2. Install Required Packages

    Install all necessary packages using the requirements.txt file:

    pip install -r requirements.txt
  3. Run Jupyter Notebook

    We utilized Jupyter Notebooks to explain and display our outputs interactively. To learn how to run a Jupyter Notebook, refer to the official documentation.

Project Structure

Jupyter Notebooks

  • part0.ipynb: Downloading the dataset.
  • part1.ipynb: Answers Part 1 questions (preparing word embeddings and mitigating OOV).
  • part2.ipynb: Answers Part 2 questions (RNN model).
  • part3a.ipynb: Answers Part 3a questions (RNN model with trainable embeddings).
  • part3b_generate_embedding.ipynb: Generates a new embedding matrix with mitigated OOV.
  • part3b.ipynb: Answers Part 3b questions (RNN model with trainable embeddings and mitigated OOV).
  • part3c_biLSTM.ipynb: Answers Part 3c question on biLSTM.
  • part3c_biGRU.ipynb: Answers Part 3c question on biGRU.
  • part3d.ipynb: Answers Part 3d questions (CNN model).
  • part3e.ipynb: Answers Part 3e questions (Transformer models).
  • part3f.ipynb: Answers Part 3f questions (model comparison).

Utilities

  • utils/analytics.py: Contains code for loading TensorBoard log files and uploading data to Weights & Biases (WandB).
  • utils/text.py: Provides functions for tokenizing and preprocessing text data, as well as computing average context embeddings.
  • utils/train.py: Includes training routines for the models, handling logic such as early stopping and logging.

Models

  • models/RNN.py: Implementation of the RNN/biLSTM/biGRU model with PyTorch Lightning wrappers for training, validation, and testing.
  • models/CNN.py: Implementation of the CNN model with PyTorch Lightning wrappers.
  • models/MetaModel.py: Implementation of the ensemble meta model.
  • models/embedding_matrix.npy: Embedding matrix based on GoogleNews300 Word2Vec.
  • models/index_from_word.json: A mapping from words to their corresponding indices in the embedding matrix.
  • models/embedding_matrix_oov.npy: Embedding matrix based on GoogleNews300 Word2Vec with OOV words filled with contextual average.
  • models/index_from_word_oov.json: A mapping from words to their corresponding indices in the embedding matrix with OOV words filled with contextual average.

Additional Scripts

  • part3e_transformers.py: A pipeline script to train, evaluate, and test Transformer models for Part 3e.
    • Example Usage: python part3e_transformers.py --model roberta
  • part3e_ensemble.py: A script to train and evaluate ensemble models for Part 3e.
  • scripts/xxx.py: A script to train certain models using the terminal instead of jupyter notebook (optional, you may run the jupyter notebook directly to train the model instead).
    • Example Usage: python scripts/train_cnn.py

Code Used for Each Part

Part Files and Scripts Used
Part 1 - utils/text.py
- part1.ipynb
Part 2 - utils/text.py
- utils/train.py
- utils/analytics.py
- models/RNN.py
- part2.ipynb
Part 3a - utils/text.py
- utils/train.py
- utils/analytics.py
- models/RNN.py
- part3a.ipynb
Part 3b - utils/text.py
- utils/train.py
- utils/analytics.py
- models/RNN.py
- part3b.ipynb
Part 3c - utils/text.py
- utils/train.py
- utils/analytics.py
- models/RNN.py
- part3c_biLSTM.ipynb
- part3c_biGRU.ipynb
Part 3d - utils/text.py
- utils/train.py
- utils/analytics.py
- models/CNN.py
- part3d.ipynb
Part 3e - utils/text.py
- utils/train.py
- utils/analytics.py
- part3e_transformers.py
- part3e_ensemble.py
- models/MetaModel.py
- part3e.ipynb
Part 3f - part3f.ipynb

Running TensorBoard

To run TensorBoard logs to view the model training and validation graphs, you can follow these instructions:

  1. Download the logs

    Download our TensorBoard logs here.

  2. Launch TensorBoard

    Place the TensorBoard logs under the tb_logs/ directory. To view these logs, run the following command:

    tensorboard --logdir=tb_logs/rnn # for rnn
    tensorboard --logdir=tb_logs/rnn_trainable_embeddings_and_contextual_oov # for rnn_trainable_embeddings_and_contextual_oov
    tensorboard --logdir=tb_logs/cnn # for cnn
    # ...
  3. Open TensorBoard in Your Browser

    After running the command, TensorBoard will display a URL (usually http://localhost:6006/). Open this link in your browser to access the training and validation metrics, including loss and accuracy for each model.


Reproducing Results with Checkpoints

To reproduce the results using the provided checkpoints, you can follow these instructions:

  1. Download Checkpoints Download our checkpoints here. The checkpoints saved together with the TensorBoard logs and only the top 10 model checkpoints are provided. Alternatively, you may download the best model checkpoints that we have collated here.

  2. Run the Jupyter Notebook

    Then go to the respective jupyter notebook and start running the cells after the model training portion and view the results.

Directory Layout

Your directory should look like this:

root_dir/
├── .venv/                          
├── best_model_predictions/         
├── models/                         
│   ├── CNN.py                      
│   ├── RNN.py                      
│   ├── MetaModel.py                
│   ├── embedding_matrix.npy        
│   ├── embedding_matrix_oov.npy    
│   ├── index_from_word.json        
│   ├── index_from_word_oov.json    
│   ├── word2vec-google-news-300    
│   └── word2vec-google-news-300.vectors.npy
├── scripts/                        
│   ├── train_bigru.py              
│   ├── train_cnn.py                
│   ├── train_lstm.py               
│   ├── train_rnn_part_2.py         
│   ├── train_rnn_part_3a.py        
│   └── ...        
├── tb_logs/                        
│   ├── rnn/                        
│   ├── rnn_trainable_embeddings/                        
│   ├── rnn_trainable_embeddings_and_contextual_oov/                        
│   ├── cnn/                        
│   ├── bigru/                        
│   ├── bilstm/                        
│   ├── transformers/                        
│   └── ...                        
├── utils/                          
│   ├── analytics.py                
│   ├── text.py                     
│   └── train.py                    
├── .gitignore                      
├── requirements.txt                
├── README.md                       
├── part0.ipynb                 
├── part1.ipynb                 
├── part2.ipynb                 
├── part3a.ipynb                
├── part3b_generate_embedding.ipynb
├── part3b.ipynb                
├── part3c_biGRU.ipynb          
├── part3c_biLSTM.ipynb         
├── part3d.ipynb                
├── part3e.ipynb                
├── part3e_ensemble.py          
├── part3e_transformers.py      
└── part3f.ipynb

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published