Domain-Agnostic Approaches to Competency Extraction via Large Language Models

Overview

This repository hosts the implementation of the Master's Thesis titled "Domain-Agnostic Approaches to Competency Extraction via Large Language Models" by Bertil Braun, submitted to the Karlsruhe Institute of Technology (KIT). The thesis develops an innovative system utilizing Large Language Models (LLMs) to extract competencies from a variety of document types, improving upon existing methods that struggle with unstructured data across diverse domains.

System Description

The competency extraction system is built on a multi-phase approach that includes selecting, fine-tuning, and evaluating LLMs across different types of documents to create accurate competency profiles. This section provides an overview of each major component of the system:

Data Processing and Summarization

The process begins with the collection of input documents that are mostly based on papers by authors from various domains. These documents are fetched and preprocessed in src/logic/papers.py, which includes the extraction of relevant text and metadata.

Competency Extraction

Extracted summaries are then processed to identify and extract competency profiles. This extraction is performed using LLMs designed to pull relevant competencies from the documents. Each document's competencies are initially profiled individually in src/extraction using three different extraction methods.

Model Fine-Tuning

The system employs advanced fine-tuning methodologies, Direct Preference Optimization (DPO), to adapt the LLMs to the specific task of competency extraction. The fine-tuning process, located in src/finetuning, optimizes the models to enhance their performance across various document types and domains by learning from synthetic data. The fine-tuning process is designed to be run on the BW-UniCluster, a high-performance computing cluster. SLURM scripts for running the setup and fine-tuning process are located in src/finetuning. For more information on the fine-tuning process, refer to src/finetuning/README.md.

Evaluation Framework

To validate the accuracy of the extracted profiles, the system incorporates both expert and automatic evaluation mechanisms. These evaluations compare the competencies extracted by the system against benchmarks set by human experts and automated systems to ensure reliability and accuracy. The evaluation scripts are found in src/scripts/automatic_evaluation_correlation_analysis.py.

Visualization and Reporting

For easy interpretation and analysis, the system generates visualizations and structured reports of the competency profiles. This functionality, designed to help users quickly understand and utilize the extracted data, is handled by templates in src/templates and generated by src/logic/display.py.

Installation and Setup

To setup the system, follow these steps:

# Clone the repository
git clone https://github.com/BertilBraun/Master-Thesis.git Master-Thesis
# Navigate to the project directory
cd Master-Thesis
# Install the required dependencies
pip install -r requirements.txt

The system requires Python 3.11 or higher to run as a result of the strong type annotations added.

Furthermore, to be able to use the system, you need to add API keys for OpenAI and JsonBin.io to the src/defines.py file.

For the fine-tuning process, you need to have access to the BW-UniCluster. Other high-performance computing clusters can be used as well, but the SLURM scripts might need to be adjusted accordingly. For a detailed guide on how to set up the fine-tuning process, refer to src/finetuning/README.md.

Documentation

Detailed documentation of the system and its components is available within the repository. This includes the full text of the Master's Thesis located under documentation/Master_Thesis.pdf, as well as additional supporting materials and references.

License

This project is licensed under the MIT License, which allows for extensive reuse and modification in academic and commercial projects.

Name		Name	Last commit message	Last commit date
Latest commit History 353 Commits
.vscode		.vscode
documentation		documentation
src		src
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain-Agnostic Approaches to Competency Extraction via Large Language Models

Overview

System Description

Data Processing and Summarization

Competency Extraction

Model Fine-Tuning

Evaluation Framework

Visualization and Reporting

Installation and Setup

Documentation

License

About

Releases 1

Languages

BertilBraun/Master-Thesis

Folders and files

Latest commit

History

Repository files navigation

Domain-Agnostic Approaches to Competency Extraction via Large Language Models

Overview

System Description

Data Processing and Summarization

Competency Extraction

Model Fine-Tuning

Evaluation Framework

Visualization and Reporting

Installation and Setup

Documentation

License

About

Resources

Stars

Watchers

Forks

Releases 1

Languages