This project represents a focused effort in the field of legal tech research, where I combine methodologies from natural language processing (NLP), network theory, and machine learning to analyze German legal texts. At its core, the project is structured into two principal components: the generation of a domain-specific knowledge graph and the application of this graph within a Retrieval-Augmented Generation (RAG) system to enhance language model responses.
-
Knowledge Graph Creation: The initial phase involves processing legal texts to develop a comprehensive knowledge graph. This graph visualizes key concepts and their interconnections, offering a detailed map of the legal textual environment.
-
Context-Enriched RAG System: The ultimate goal of the project is to leverage the knowledge graph in a RAG system. This system enriches and improves the context used in language model responses, ensuring they are grounded in domain-specific legal expertise. It represents a significant step towards more accurate and context-aware responses in legal text analysis.
- Semantic text splitting for in-depth analysis of legal texts.
- Interactive knowledge graph construction from legal documents.
- Integration with OpenAI's GPT-3.5 and GPT-4 for cutting-edge NLP capabilities.
- Network graph analysis and visualization for elucidating textual relationships.
- Use of vector space modeling for analyzing text and graph components.
- Contextual analysis techniques, including TF-IDF scoring and cosine similarity, for enhanced understanding.
- Application of the knowledge graph in a RAG system for context-rich language model responses.
-
Clone the Repository:
git clone https://github.com/TilmanLudewigtHaufe/Graph_Augmented_RAG.git cd your-repo-name
-
Set Up Environment:
- Ensure Python 3.8+ is installed.
- Install required packages:
pip install -r requirements.txt
-
Environment Variables:
- Create a
.env
file in the root directory. - Add
OPENAI_API_KEY=your_key_here
.
- Create a
-
Data Preparation:
- Place your text data (.txt) in the
data_input
directory.
- Place your text data (.txt) in the
Before generating the Knowledge Graph or running the Graph-Augmented Retrieval-Augmented Generation, you need to decide and set the splitter.
The splitter is a crucial component that determines how the text data is divided into distinct concepts for the Knowledge Graph. Similarly, in the Graph-Augmented Retrieval-Augmented Generation, the splitter plays a key role in breaking down user queries into manageable chunks for processing.
You can set the splitter in the KG_Creation.py
and Graph_RAG_advanced_RAG.py
scripts.
Remember, the choice of splitter can significantly impact the performance and results of the system. Therefore, choose a splitter that best suits your data and use case. In the respected code sections three various splitters with different approaches (simple to complex and custom) are provided which can be adjusted to your needs.
-
Creating the Knowledge Graph:
- Run the
KG_Creation.py
script to generate the knowledge graph from your text data.python KG_Creation.py
- This process analyzes the text, constructs a graph of interconnected concepts, and saves the graph data.
- Run the
-
Visualizing the Knowledge Graph:
- After running
KG_Creation.py
, opendocs/index.html
in a web browser. - This file contains an interactive visualization of the knowledge graph, allowing you to explore the relationships and structures within your data.
- After running
-
Executing the Main Script:
- To engage with the graph-augmented retrieval and generation capabilities, run the
Graph_RAG_advanced_RAG.py
script.python Graph_RAG_advanced_RAG.py
- This script uses the knowledge graph to enhance text retrieval and generation, providing deeper insights and context for user queries.
- To engage with the graph-augmented retrieval and generation capabilities, run the
-
Interacting with the System:
- Enter queries at the prompt within the
Graph_RAG_advanced_RAG.py
script interface. - Use commands like
quit
,q
, orexit
to end the interactive session.
- Enter queries at the prompt within the
-
Analyzing Outputs:
- The application provides responses based on the knowledge graph and the augmented retrieval mechanism, offering a rich and contextual understanding of the query topics.
- Check the output on the console and any generated files in the
data_output
directory for detailed insights.
- Ensure that the
KG_Creation.py
script is executed before runningGraph_RAG_advanced_RAG.py
, as the latter depends on the knowledge graph generated by the former.
This project is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. This allows for non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
For more details, see the full text of the license.