Skip to content

Amsterdam-Internships/document-classification-using-large-language-models

Repository files navigation

Document Classification using Large Language Models

This repo is for code related to the project Document Classification under the Woo (Open Government Act). The goal of this research is to find an effective method to classify documents that the Municipality of Amsterdam publishes. We evaluate three LLMs for this task: Llama, Mistral and GEITje.

overview

The research consists of four main parts:

  1. Truncation experiment. The documents are too long to give as input to the models, thus a truncation experiment is run to find the best truncation threshold to shorten the documents.
  2. In-context learning experiment. To evaluate how well the LLMs perform this task in their current state an in-context learning experiment is performed. We compare the performance of a zero-shot prompt to that of a few-shot prompt.
  3. Fine-tuning experiment. We compare the results of the in-context learning to that of the LLMs after they have been fine-tuned on the task.
  4. Baselines experiment. To evaluate the worthwhile of the LLMs, we compare them to simple baselines, such as Naïve Bayes and Linear SVM.

Folder Structure

  • src: All sourcecode files specific to this project
  • output Folder to save data and predictions. Same structure as the Blobfuse folder.

Installation

  1. Clone this repository:
git clone https://github.com/AmsterdamInternships/document-classification-using-large-language-models.git
  1. Install all dependencies:

Here we have an overview of the requirements.

pip install torch
pip install datasets==2.19.1
pip install transformers==4.40.2
pip install trl
pip install accelerate
pip install sentencepiec
pip install jupyter
pip install protobuf 
pip install bitsandbytes
pip install bnb
pip install wandb==0.13.3
pip install tensorboardX

Additionally, we have included a requirements.txt file where all libraries and their versions are available

The code has been tested with Python 3.9 on Windows.

How it works

All code can be run in the notebooks. They have been numbered in the right order of usage.

  • 0FileOverview -> An overview of all the data files and their columns explained. Does not include code.
  • 1load_txt -> Loads the text from the OCR files. Note, that there is separate notebook to load the files from Blobfuse, because of the complicated folder structure.
  • 2clean_data -> Removes messy documents and duplicates. Splits data into subsets.
  • 3TokenizeText -> Tokenize the documents using either the Mistral or the Llama tokenizer.
  • 4FinetuningDataFormatting -> Format the data frame with documents into a dataset that is pushed to HuggingFace.
  • 5Finetuning -> Finetune the LLMs on the dataset on HuggingFace.
  • 6GetPredictions -> Run the experiments for the LLMs (IC and FT).
  • 7baseline -> Train the baselines and run the experiments.
  • 8plot -> Notebook with plots.

Stand-alone notebooks:

Feel free to help out! Open an issue, submit a PR or contact us.

HuggingFace

The fine-tuned models, the dataset with the complete documents and the conversation dataset used to fine-tune the models are published on HuggingFace.

Acknowledgements

This repository was created in collaboration with Amsterdam Intelligence for the City of Amsterdam.

License

This project is licensed under the terms of the European Union Public License 1.2 (EUPL-1.2).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published