Skip to content

stackitcloud/rag-core-library

Repository files navigation

RAG Core library

This repository contains the core of the STACKIT RAG template. It consists of the following python packages:

With the exception of the RAG Core lib all of these packages contain an API definition and are easy to adjust for your specific use case. Each of the packages defines the replaceable parts(1.3 Replaceable Parts, 2.3 Replaceable Parts, 3.3 Replaceable Parts), expected types and offer a brief description.

ⓘ INFO: If you replace parts it is important to keep the name of the component, otherwise the replacing-logic will not work.

This repository also contains a Dockerfile that is used to ensure proper linting and testing of the packages.

For an example on how to use the packages, please consult the use case example repository

1. RAG Core API

The rag-core-api contains a default implementation of a RAG. For a default use case, no adjustments should be required.

The following endpoints are provided by the backend:

  • /chat/{session_id}: The endpoint for chatting.
  • /evaluate: Will start the evaluation of the RAG using the provided question-answer pairs.
  • /information_pieces/remove: Endpoint to remove documents from the vector database.
  • /information_pieces/upload: Endpoint to upload documents into the vector database. These documents need to have been parsed. For simplicity, a LangChain Documents like format is used.

1.1 Requirements

All required python libraries can be found in the pyproject.toml file. In addition to python libraries, the following system packages are required:

build-essential
make

1.2 Endpoints

/chat/{session_id}

This endpoint is used for chatting.

/evaluate

Will start the evaluation of the RAG using the provided question-answer pairs. The file containing the dataset can be set by changing the RAGAS_DATASET_FILENAME environment variable, the default is test_data.json. This path can be either an absolute path, or a path relative to the current working directory. By default OpenAI is used by the evaluation. If you want to use the same LLM-class for the evaluation as is used for the chat you have to set the environment variable RAGA_USE_OPENAI to false and adjust the RAGAS_MODEL environment variable to the model-name of your choice.

📝 NOTE: Due to quality problems with open-source LLMs, it is recommended to use OpenAI for the evaluation.

/information_pieces/remove

Endpoint to remove documents from the vector database.

/information_pieces/upload

Endpoint to upload documents into the vector database. These documents need to have been parsed. For simplicity, a LangChain Documents like format is used. Uploaded documents are required to contain the following metadata:

  • document_url that points to a download link to the source document.
  • All documents of the type IMAGE require the content of the image encoded in base64 in the base64_image key.

1.3 Replaceable parts

Name Type Default Notes
embedder rag_core_api.embeddings.embedder.Embedder Depends on your settings. Can be rag_core_api.impl.embeddings.langchain_community_embedder.LangchainCommunityEmbedder or rag_core_api.impl.embeddings.stackit_embedder.StackitEmbedder Selected by EmbedderClassTypeSettings.embedder_type.
vector_database rag_core_api.vector_databases.vector_database.VectorDatabase rag_core_api.impl.vector_databases.qdrant_database.QdrantDatabase
reranker rag_core_api.reranking.reranker.Reranker rag_core_api.impl.reranking.flashrank_reranker.FlashrankReranker Used in the composed_retriever
composed_retriever rag_core_api.retriever.retriever.Retriever rag_core_api.impl.retriever.composite_retriever.CompositeRetriever Handles retrieval, re-ranking, etc.
large_language_model langchain_core.language_models.llms.BaseLLM langchain_community.llms.vllm.VLLMOpenAI, langchain_community.llms.Ollama or langchain_community.llms.FakeListLLM The LLm that is used for all LLM tasks. The default depends on the value of rag_core_lib.impl.settings.rag_class_types_settings.RAGClassTypeSettings.llm_type. The FakeListLLM is used for testing
prompt str rag_core_api.prompt_templates.answer_generation_prompt.ANSWER_GENERATION_PROMPT The prompt used for answering the question.
rephrasing_prompt str rag_core_api.prompt_templates.question_rephrasing_prompt.ANSWER_REPHRASING_PROMPT The prompt used for rephrasing the question. The rephrased question (and the original question are both used for retrival of the documents)
langfuse_manager rag_core_lib.impl.langfuse_manager.langfuse_manager.LangfuseManager rag_core_lib.impl.langfuse_manager.langfuse_manager.LangfuseManager Retrieves additional settings, as well as the prompt from langfuse if available.
answer_generation_chain rag_core_lib.chains.async_chain.AsyncChain[rag_core_api.impl.graph.graph_state.graph_state.AnswerGraphState, str] rag_core_api.impl.answer_generation_chains.answer_generation_chain.AnswerGenerationChain LangChain chain used for answering the question. Is part of the chat_graph,
rephrasing_chain rag_core_lib.chains.async_chain.AsyncChain[rag_core_api.impl.graph.graph_state.graph_state.AnswerGraphState, str] rag_core_api.impl.answer_generation_chains.rephrasing_chain.RephrasingChain LangChain chain used for rephrasing the question. Is part of the chat_graph.
chat_graph rag_core_api.graph.graph_base.GraphBase rag_core_api.impl.graph.chat_graph.DefaultChatGraph Langgraph graph that contains the entire logic for question answering.
traced_chat_graph rag_core_lib.chains.async_chain.AsyncChain[Any, Any] rag_core_lib.impl.tracers.langfuse_traced_chain.LangfuseTracedGraph Wraps around the chat_graph and add langfuse tracing.
evaluator rag_core_api.impl.evaluator.langfuse_ragas_evaluator.LangfuseRagasEvaluator rag_core_api.impl.evaluator.langfuse_ragas_evaluator.LangfuseRagasEvaluator The evaulator used in the evaluate endpoint.
chat_endpoint rag_core_api.api_endpoints.chat.Chat rag_core_api.impl.api_endpoints.default_chat.DefaultChat Implementation of the chat endpoint. Default implementation just calls the traced_chat_graph
ragas_llm langchain_core.language_models.chat_models.BaseChatModel langchain_openai.ChatOpenAI or langchain_ollama.ChatOllama The LLM used for the ragas evaluation.

2. Admin API Lib

The Admin API Library contains all required components for file management capabilities for RAG systems, handling all document lifecycle operations. It also includes a default dependency_container, that is pre-configured and should fit most use-cases.

The following endpoints are provided by the admin-api-lib:

  • /delete_document/{identification}: Deletes the file from storage (if applicable) and vector database. The identification can be retrieved from the /all_documents_status endpoint.
  • /document_reference/{identification}: Returns the document.
  • /all_documents_status: Return the identification and status of all available sources.
  • /upload_documents: Endpoint to upload files.
  • /load_confluence: Endpoint to load a confluence space

2.1 Requirements

All required python libraries can be found in the pyproject.toml file. In addition to python libraries, the following system packages are required:

build-essential
make

2.2 Endpoints

/delete_document/{identification}

Will delete the document from the connected storage system and will send a request to the backend to delete all related Documents from the vector database.

/document_reference/{identification}

Will return the source document stored in the connected storage system.

ⓘ INFO: Confluence pages are not stored in the connected storage system. They are only stored in the vector database and can't be retrieved using this endpoint.

/all_documents_status

Will return a list of all sources for the chat and their current status.

/upload_documents

Files can be uploaded here. This endpoint will process the document in a background and will extract information using the document-extractor. The extracted information will be summarized using a LLM. The summary, as well as the unrefined extracted document, will be uploaded to the rag-core-api.

/load_confluence

Loads all the content of a confluence space using the document-extractor. The extracted information will be summarized using LLM. The summary, as well as the unrefined extracted document, will be uploaded to the rag-core-api.

2.3 Replaceable parts

Name Type Default Notes
file_service admin_api_lib.file_services.file_service.FileService admin_api_lib.impl.file_services.s3_service.S3Service Handles operations on the connected storage.
large_language_model langchain_core.language_models.llms.BaseLLM langchain_community.llms.vllm.VLLMOpenAI or langchain_community.llms.Ollama The LLm that is used for all LLM tasks. The default depends on the value of rag_core_lib.impl.settings.rag_class_types_settings.RAGClassTypeSettings.llm_type
key_value_store admin_api_lib.impl.key_db.file_status_key_value_store.FileStatusKeyValueStore admin_api_lib.impl.key_db.file_status_key_value_store.FileStatusKeyValueStore Is used for storing the available sources and their current state.
chunker admin_api_lib.impl.chunker.chunker.Chunker admin_api_lib.impl.chunker.text_chunker.TextChunker Used for splitting the documents in managable chunks.
document_extractor admin_api_lib.extractor_api_client.openapi_client.api.extractor_api.ExtractorApi admin_api_lib.extractor_api_client.openapi_client.api.extractor_api.ExtractorApi Needs to be replaced if adjustments to the extractor-api is made.
rag_api admin_api_lib.rag_backend_client.openapi_client.api.rag_api.RagApi admin_api_lib.rag_backend_client.openapi_client.api.rag_api.RagApi Needs to be replaced if changes to the /information_pieces/remove or /information_pieces/upload of the rag-core-api are made.
summarizer_prompt str admin_api_lib.prompt_templates.summarize_prompt.SUMMARIZE_PROMPT The prompt used of the summarization.
langfuse_manager rag_core_lib.impl.langfuse_manager.langfuse_manager.LangfuseManager rag_core_lib.impl.langfuse_manager.langfuse_manager.LangfuseManager Retrieves additional settings, as well as the prompt from langfuse if available.
summarizer admin_api_lib.summarizer.summarizer.Summarizer admin_api_lib.impl.summarizer.langchain_summarizer.LangchainSummarizer Creates the summaries.
untraced_information_enhancer admin_api_lib.information_enhancer.information_enhancer.InformationEnhancer admin_api_lib.impl.information_enhancer.general_enhancer.GeneralEnhancer Uses the summarizer to enhance the extracted documents.
information_enhancer rag_core_lib.chains.async_chain.AsyncChain[Any, Any] rag_core_lib.impl.tracers.langfuse_traced_chain.LangfuseTracedGraph Wraps around the untraced_information_enhancer and adds langfuse tracing.
document_deleter admin_api_lib.api_endpoints.document_deleter.DocumentDeleter admin_api_lib.impl.api_endpoints.default_document_deleter.DefaultDocumentDeleter Handles deletion of sources.
documents_status_retriever admin_api_lib.api_endpoints.documents_status_retriever.DocumentsStatusRetriever admin_api_lib.impl.api_endpoints.default_documents_status_retriever.DefaultDocumentsStatusRetriever Handles return of source status.
confluence_loader admin_api_lib.api_endpoints.confluence_loader.ConfluenceLoader admin_api_lib.impl.api_endpoints.default_confluence_loader.DefaultConfluenceLoader Handles data loading and extraction from confluence.
document_reference_retriever admin_api_lib.api_endpoints.document_reference_retriever.DocumentReferenceRetriever admin_api_lib.impl.api_endpoints.default_document_reference_retriever.DefaultDocumentReferenceRetriever Handles return of files from connected storage.
document_uploader admin_api_lib.api_endpoints.document_uploader.DocumentUploader admin_api_lib.impl.api_endpoints.default_document_uploader.DefaultDocumentUploader Handles upload and extraction of files.

3. Extractor API Lib

The Extractor Library contains components that provide document parsing capabilities for various file formats. It also includes a default dependency_container, that is pre-configured and is a good starting point for most use-cases. This API should not be exposed by ingress and only used for internally.

The following endpoints are provided by the extractor-api-lib:

  • /extract_from_file: This endpoint extracts the information from files.
  • /extract_from_confluence: This endpoint extracts the information from a confluence space.

3.1 Requirements

All required python libraries can be found in the pyproject.toml file. In addition to python libraries, the following system packages are required:

build-essential
make
ffmpeg
poppler-utils
tesseract-ocr
tesseract-ocr-deu
tesseract-ocr-eng

3.2 Endpoints

/extract_from_file

This endpoint will extract the information from PDF,PTTX,WORD,XML files. It will load the files from the connected storage. The following types of information will be extracted:

  • TEXT: plain text
  • TABLE: data in tabular form found in the document

/extract_from_confluence

The extract from confluence endpoint will extract the information from a confluence space. The following types of information will be extracted:

  • TEXT: plain text

3.3 Replaceable parts

Name Type Default Notes
file_service extractor_api_lib.file_services.file_service.FileService extractor_api_lib.file_services.s3_service.S3Service Handles operations on the connected storage.
database_converter extractor_api_lib.document_parser.table_converters.dataframe_converter.DataframeConverter extractor_api_lib.document_parser.table_converters.dataframe2markdown.DataFrame2Markdown Converts the extracted table from pandas.DataFrame to markdown. If you want the table to have another format, this would need to be adjusted.
pdf_extractor extractor_api_lib.document_parser.information_extractor.InformationExtractor extractor_api_lib.document_parser.pdf_extractor.PDFExtractor Extractor used for extracting information from PDF documents.
ms_docs_extractor extractor_api_lib.document_parser.information_extractor.InformationExtractor extractor_api_lib.document_parser.ms_docs_extractor.MSDocsExtractor Extractor used for extracting information from Microsoft Documents like *.docx, etc.
xml_extractor extractor_api_lib.document_parser.information_extractor.InformationExtractor extractor_api_lib.document_parser.xml_extractor.XMLExtractor Extractor used for extracting content from XML documents.
all_extractors dependency_injector.providers.List[extractor_api_lib.document_parser.information_extractor.InformationExtractor] dependency_injector.providers.List(pdf_extractor, ms_docs_extractor, xml_extractor) List of all available extractors. If you add a new type of extractor you would have to add it to this list.
general_extractor extractor_api_lib.document_parser.information_extractor.InformationExtractor extractor_api_lib.document_parser.general_extractor.GeneralExtractor Combines multiple extractors and decides which one to use for the given file format.
file_extractor extractor_api_lib.api_endpoints.file_extractor.FileExtractor extractor_api_lib.impl.api_endpoints.default_file_extractor.DefaultFileExtractor Implementation of the /extract_from_file endpoint. Uses general_extractor.
confluence_extractor extractor_api_lib.api_endpoints.confluence_extractor.ConfluenceExtractor extractor_api_lib.impl.api_endpoints.default_confluence_extractor.DefaultConfluenceExtractor Implementation of the /extract_from_confluence endpoint.

4. RAG Core Lib

The rag-core-lib contains components of the rag-core-api that are also useful for other services and therefore are packaged in a way that makes it easy to use. Examples of included components:

  • tracing for LangChain chains using Langfuse
  • settings for multiple LLMs and Langfuse
  • factory for LLMs
  • ContentType enum of the Documents.
  • ...

4.1 Requirements

All required python libraries can be found in the pyproject.toml file. In addition to python libraries the following system packages are required:

build-essential
make

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages