Skip to content

My KIT Masters-Thesis on Competence Extraction using LLMs

Notifications You must be signed in to change notification settings

BertilBraun/Master-Thesis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Domain-Agnostic Approaches to Competency Extraction via Large Language Models

Overview

This repository hosts the implementation of the Master's Thesis titled "Domain-Agnostic Approaches to Competency Extraction via Large Language Models" by Bertil Braun, submitted to the Karlsruhe Institute of Technology (KIT). The thesis develops an innovative system utilizing Large Language Models (LLMs) to extract competencies from a variety of document types, improving upon existing methods that struggle with unstructured data across diverse domains.

System Description

The competency extraction system is built on a multi-phase approach that includes selecting, fine-tuning, and evaluating LLMs across different types of documents to create accurate competency profiles. This section provides an overview of each major component of the system:

System Overview

Data Processing and Summarization

The process begins with the collection of input documents that are mostly based on papers by authors from various domains. These documents are fetched and preprocessed in src/logic/papers.py, which includes the extraction of relevant text and metadata.

Competency Extraction

Extracted summaries are then processed to identify and extract competency profiles. This extraction is performed using LLMs designed to pull relevant competencies from the documents. Each document's competencies are initially profiled individually in src/extraction using three different extraction methods.

Model Fine-Tuning

The system employs advanced fine-tuning methodologies, Direct Preference Optimization (DPO), to adapt the LLMs to the specific task of competency extraction. The fine-tuning process, located in src/finetuning, optimizes the models to enhance their performance across various document types and domains by learning from synthetic data. The fine-tuning process is designed to be run on the BW-UniCluster, a high-performance computing cluster. SLURM scripts for running the setup and fine-tuning process are located in src/finetuning. For more information on the fine-tuning process, refer to src/finetuning/README.md.

Evaluation Framework

To validate the accuracy of the extracted profiles, the system incorporates both expert and automatic evaluation mechanisms. These evaluations compare the competencies extracted by the system against benchmarks set by human experts and automated systems to ensure reliability and accuracy. The evaluation scripts are found in src/scripts/automatic_evaluation_correlation_analysis.py.

Visualization and Reporting

For easy interpretation and analysis, the system generates visualizations and structured reports of the competency profiles. This functionality, designed to help users quickly understand and utilize the extracted data, is handled by templates in src/templates and generated by src/logic/display.py.

Installation and Setup

To setup the system, follow these steps:

# Clone the repository
git clone https://github.com/BertilBraun/Master-Thesis.git Master-Thesis
# Navigate to the project directory
cd Master-Thesis
# Install the required dependencies
pip install -r requirements.txt

The system requires Python 3.11 or higher to run as a result of the strong type annotations added.

Furthermore, to be able to use the system, you need to add API keys for OpenAI and JsonBin.io to the src/defines.py file.

For the fine-tuning process, you need to have access to the BW-UniCluster. Other high-performance computing clusters can be used as well, but the SLURM scripts might need to be adjusted accordingly. For a detailed guide on how to set up the fine-tuning process, refer to src/finetuning/README.md.

Documentation

Detailed documentation of the system and its components is available within the repository. This includes the full text of the Master's Thesis located under documentation/Master_Thesis.pdf, as well as additional supporting materials and references.

License

This project is licensed under the MIT License, which allows for extensive reuse and modification in academic and commercial projects.

About

My KIT Masters-Thesis on Competence Extraction using LLMs

Resources

Stars

Watchers

Forks