Document Classification using Large Language Models

This repo is for code related to the project Document Classification under the Woo (Open Government Act). The goal of this research is to find an effective method to classify documents that the Municipality of Amsterdam publishes. We evaluate three LLMs for this task: Llama, Mistral and GEITje.

The research consists of four main parts:

Truncation experiment. The documents are too long to give as input to the models, thus a truncation experiment is run to find the best truncation threshold to shorten the documents.
In-context learning experiment. To evaluate how well the LLMs perform this task in their current state an in-context learning experiment is performed. We compare the performance of a zero-shot prompt to that of a few-shot prompt.
Fine-tuning experiment. We compare the results of the in-context learning to that of the LLMs after they have been fine-tuned on the task.
Baselines experiment. To evaluate the worthwhile of the LLMs, we compare them to simple baselines, such as Naïve Bayes and Linear SVM.

Folder Structure

local_data: Folder with data, structure matches the folder on Azure.
local_data/data_files: Folder with demo files.
local_data/predictionsFinal: Folder structure to save the predictions. Does not include predictions, predictions are saved on Azure.
notebooks: Jupyter notebooks / tutorials
PredictionAnalysis: Only jupyter notebooks about prediction analysis

src: All sourcecode files specific to this project
output Folder to save data and predictions. Same structure as the Blobfuse folder.

Installation

Clone this repository:

git clone https://github.com/AmsterdamInternships/document-classification-using-large-language-models.git

Install all dependencies:

Here we have an overview of the requirements.

pip install torch
pip install datasets==2.19.1
pip install transformers==4.40.2
pip install trl
pip install accelerate
pip install sentencepiec
pip install jupyter
pip install protobuf 
pip install bitsandbytes
pip install bnb
pip install wandb==0.13.3
pip install tensorboardX

Additionally, we have included a requirements.txt file where all libraries and their versions are available

The code has been tested with Python 3.9 on Windows.

How it works

All code can be run in the notebooks. They have been numbered in the right order of usage.

0FileOverview -> An overview of all the data files and their columns explained. Does not include code.
1load_txt -> Loads the text from the OCR files. Note, that there is separate notebook to load the files from Blobfuse, because of the complicated folder structure.
2clean_data -> Removes messy documents and duplicates. Splits data into subsets.
3TokenizeText -> Tokenize the documents using either the Mistral or the Llama tokenizer.
4FinetuningDataFormatting -> Format the data frame with documents into a dataset that is pushed to HuggingFace.
5Finetuning -> Finetune the LLMs on the dataset on HuggingFace.
6GetPredictions -> Run the experiments for the LLMs (IC and FT).
7baseline -> Train the baselines and run the experiments.
8plot -> Notebook with plots.

Stand-alone notebooks:

EDA
load_txt_azure -> Azure version of 1loadtxt.ipynb. The folders on Blobfuse are messily structured.
RepairMistralPredictions -> Repair mistakes made by mistral. Exceptional to Mistral.

Feel free to help out! Open an issue, submit a PR or contact us.

HuggingFace

The fine-tuned models, the dataset with the complete documents and the conversation dataset used to fine-tune the models are published on HuggingFace.

Acknowledgements

This repository was created in collaboration with Amsterdam Intelligence for the City of Amsterdam.

License

This project is licensed under the terms of the European Union Public License 1.2 (EUPL-1.2).

Name		Name	Last commit message	Last commit date
Latest commit History 225 Commits
.github/workflows		.github/workflows
PredictionAnalysis		PredictionAnalysis
local_data		local_data
notebooks		notebooks
src		src
.flake8		.flake8
.flake8_nb		.flake8_nb
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
Rebuttal_Visuals.pdf		Rebuttal_Visuals.pdf
config.py		config.py
config_azure.py		config_azure.py
fuse_connection_raadsinformatie.cfg		fuse_connection_raadsinformatie.cfg
general_approach.png		general_approach.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Classification using Large Language Models

Folder Structure

Installation

How it works

HuggingFace

Acknowledgements

License

About

Releases

Packages

Languages

License

Amsterdam-Internships/document-classification-using-large-language-models

Folders and files

Latest commit

History

Repository files navigation

Document Classification using Large Language Models

Folder Structure

Installation

How it works

HuggingFace

Acknowledgements

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages