This repository hosts the implementation of the Master's Thesis titled "Domain-Agnostic Approaches to Competency Extraction via Large Language Models" by Bertil Braun, submitted to the Karlsruhe Institute of Technology (KIT). The thesis develops an innovative system utilizing Large Language Models (LLMs) to extract competencies from a variety of document types, improving upon existing methods that struggle with unstructured data across diverse domains.
The competency extraction system is built on a multi-phase approach that includes selecting, fine-tuning, and evaluating LLMs across different types of documents to create accurate competency profiles. This section provides an overview of each major component of the system:
The process begins with the collection of input documents that are mostly based on papers by authors from various domains. These documents are fetched and preprocessed in src/logic/papers.py
, which includes the extraction of relevant text and metadata.
Extracted summaries are then processed to identify and extract competency profiles. This extraction is performed using LLMs designed to pull relevant competencies from the documents. Each document's competencies are initially profiled individually in src/extraction
using three different extraction methods.
The system employs advanced fine-tuning methodologies, Direct Preference Optimization (DPO), to adapt the LLMs to the specific task of competency extraction. The fine-tuning process, located in src/finetuning
, optimizes the models to enhance their performance across various document types and domains by learning from synthetic data. The fine-tuning process is designed to be run on the BW-UniCluster, a high-performance computing cluster. SLURM scripts for running the setup and fine-tuning process are located in src/finetuning
. For more information on the fine-tuning process, refer to src/finetuning/README.md
.
To validate the accuracy of the extracted profiles, the system incorporates both expert and automatic evaluation mechanisms. These evaluations compare the competencies extracted by the system against benchmarks set by human experts and automated systems to ensure reliability and accuracy. The evaluation scripts are found in src/scripts/automatic_evaluation_correlation_analysis.py
.
For easy interpretation and analysis, the system generates visualizations and structured reports of the competency profiles. This functionality, designed to help users quickly understand and utilize the extracted data, is handled by templates in src/templates
and generated by src/logic/display.py
.
To setup the system, follow these steps:
# Clone the repository
git clone https://github.com/BertilBraun/Master-Thesis.git Master-Thesis
# Navigate to the project directory
cd Master-Thesis
# Install the required dependencies
pip install -r requirements.txt
The system requires Python 3.11 or higher to run as a result of the strong type annotations added.
Furthermore, to be able to use the system, you need to add API keys for OpenAI and JsonBin.io to the src/defines.py
file.
For the fine-tuning process, you need to have access to the BW-UniCluster. Other high-performance computing clusters can be used as well, but the SLURM scripts might need to be adjusted accordingly. For a detailed guide on how to set up the fine-tuning process, refer to src/finetuning/README.md
.
Detailed documentation of the system and its components is available within the repository. This includes the full text of the Master's Thesis located under documentation/Master_Thesis.pdf
, as well as additional supporting materials and references.
This project is licensed under the MIT License, which allows for extensive reuse and modification in academic and commercial projects.