This repo is for code related to the project Document Classification under the Woo (Open Government Act). The goal of this research is to find an effective method to classify documents that the Municipality of Amsterdam publishes. We evaluate three LLMs for this task: Llama, Mistral and GEITje.
The research consists of four main parts:
- Truncation experiment. The documents are too long to give as input to the models, thus a truncation experiment is run to find the best truncation threshold to shorten the documents.
- In-context learning experiment. To evaluate how well the LLMs perform this task in their current state an in-context learning experiment is performed. We compare the performance of a zero-shot prompt to that of a few-shot prompt.
- Fine-tuning experiment. We compare the results of the in-context learning to that of the LLMs after they have been fine-tuned on the task.
- Baselines experiment. To evaluate the worthwhile of the LLMs, we compare them to simple baselines, such as Naïve Bayes and Linear SVM.
local_data
: Folder with data, structure matches the folder on Azure.local_data/data_files
: Folder with demo files.local_data/predictionsFinal
: Folder structure to save the predictions. Does not include predictions, predictions are saved on Azure.notebooks
: Jupyter notebooks / tutorialsPredictionAnalysis
: Only jupyter notebooks about prediction analysis
src
: All sourcecode files specific to this projectoutput
Folder to save data and predictions. Same structure as the Blobfuse folder.
- Clone this repository:
git clone https://github.com/AmsterdamInternships/document-classification-using-large-language-models.git
- Install all dependencies:
Here we have an overview of the requirements.
pip install torch
pip install datasets==2.19.1
pip install transformers==4.40.2
pip install trl
pip install accelerate
pip install sentencepiec
pip install jupyter
pip install protobuf
pip install bitsandbytes
pip install bnb
pip install wandb==0.13.3
pip install tensorboardX
Additionally, we have included a requirements.txt file where all libraries and their versions are available
The code has been tested with Python 3.9 on Windows.
All code can be run in the notebooks. They have been numbered in the right order of usage.
0FileOverview
-> An overview of all the data files and their columns explained. Does not include code.1load_txt
-> Loads the text from the OCR files. Note, that there is separate notebook to load the files from Blobfuse, because of the complicated folder structure.2clean_data
-> Removes messy documents and duplicates. Splits data into subsets.3TokenizeText
-> Tokenize the documents using either the Mistral or the Llama tokenizer.4FinetuningDataFormatting
-> Format the data frame with documents into a dataset that is pushed to HuggingFace.5Finetuning
-> Finetune the LLMs on the dataset on HuggingFace.6GetPredictions
-> Run the experiments for the LLMs (IC and FT).7baseline
-> Train the baselines and run the experiments.8plot
-> Notebook with plots.
Stand-alone notebooks:
EDA
load_txt_azure
-> Azure version of 1loadtxt.ipynb. The folders on Blobfuse are messily structured.RepairMistralPredictions
-> Repair mistakes made by mistral. Exceptional to Mistral.
Feel free to help out! Open an issue, submit a PR or contact us.
The fine-tuned models, the dataset with the complete documents and the conversation dataset used to fine-tune the models are published on HuggingFace.
This repository was created in collaboration with Amsterdam Intelligence for the City of Amsterdam.
This project is licensed under the terms of the European Union Public License 1.2 (EUPL-1.2).