Skip to content

torleifg/book-reviews-genre-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Label Genre and Form Classification of Book Reviews

This repository contains code and resources for fine-tuning a BERT-based model for multi-label genre and form classification of book reviews. It uses BERT-based language models from Nasjonalbiblioteket (Norwegian National Library) and a dataset drawn from the open API of Biblioteksentralen. The dataset is highly imbalanced.

Overview

  • Multi-Label Classification: Each book review can belong to multiple genre and form.
  • Fine-Tuning BERT: The model is fine-tuned using a chosen BERT-based langauge model.
  • Evaluation: The model is evaluated using metrics such as F1 macro score.

Resources

Getting Started

1. Install Python (Mac)

Install pyenv:

brew install pyenv

Install xz (if using M1 or M2 Mac):

brew install xz

Install Python (max version 3.12.*):

pyenv install 3.12.7     

Switch to Python version:

pyenv global 3.12.7     

Verify Python version

python --version  

2. Set Up the Virtual Environment

In the root folder of the project. Start by creating a virtual environment for managing dependencies:

python -m venv env

Activate the virtual environment:

source env/bin/activate

Install requirements:

pip install -r requirements.txt

3. Install JupyterLab Desktop

https://github.com/jupyterlab/jupyterlab-desktop

Open the project in JupyterLAb and activate the newly created virtual environment (upper right corner).

4. Create Dataset

The dataset contains metadata including reviews and associated genre and form labels. Since the dataset is highly imbalanced, techniques such as oversampling, undersampling, or data augmentation may be applied to improve the performance of the model.

Run the create_dataset.ipynb notebook to create the dataset.

5. Describe Dataset

Run the describe_dataset.ipynb notebook to explore and visualize the dataset distribution.

6. Fine-Tuning

Choose and fine-tune a model by running the fine_tune_model.ipynb notebook. This notebook will:

  • load the dataset.
  • process the data for multi-label classification.
  • handle data imbalance using appropriate techniques.
  • fine-tune the model on the prepared dataset.

7. Classification

Once the model has been fine-tuned, you can use the genre_classification.ipynb notebook to classify new book reviews into genre and form. This notebook allows you to:

  • load the fine-tuned model and checkpoint.
  • input book reviews for genre classification.
  • output the predicted genre and form labels for the reviews.

8. Evaluation and F1 Macro Score

The model performance is evaluated using several metrics, including F1 Macro Score, which is particularly suited for imbalanced datasets like this one.

After training the NB-Bert-base model for one epoch, the F1 Macro Score was: 0.83.

After training the NB-Bert-large model for one epoch, the F1 Macro Score was: 0.89.

Releases

No releases published

Packages

No packages published