upd: docker deployment + chain links + DocumentedRunnable

artefactory-skaff · Mar 6, 2024 · 9490ae2 · 9490ae2
1 parent 0eeef24
commit 9490ae2
Show file tree

Hide file tree

Showing 33 changed files with 947 additions and 919 deletions.
diff --git a/README.md b/README.md
@@ -1,15 +1,19 @@
 # skaff-rag-accelerator
 
-This is a starter kit to prototype locally, deploy on any cloud, and industrialize a Retrival-Augmented Generation (RAG) service.
+
+This is a starter kit to deploy a modularizable RAG locally or on the cloud (or across multiple clouds)
 
 ## Features
 
 - A configurable RAG setup based around Langchain ([Check out the configuration cookbook here](https://artefactory.github.io/skaff-rag-accelerator/cookbook/))
-- `RAG` and `RagConfig` python classes to help you set things up
+- `RAG` and `RagConfig` python classes that manage components (vector store, llm, retreiver, ...)
 - A REST API based on Langserve + FastAPI to provide easy access to the RAG as a web backend
+- Optional API plugins for secure user authentication, session management, ...
+- `Chain links` primitive that facilitates chain building and allows documentation generation
 - A demo Streamlit to serve as a basic working frontend
+- `Dockerfiles` and `docker-compose` to make deployments easier and more flexible
 - A document loader for the RAG
-- Optional plugins for secure user authentication and session management
+
 
 ## Quickstart
 
@@ -25,14 +29,38 @@ Duration: ~15 minutes.
 - A few GB of disk space to host the models
 - Tested with python 3.11 (may work with other versions)
 
+### Run using docker compose
+
+If you have docker installed and running you can run the whole RAG app using it. [Otherwise, skip to the "Run directly" section](#run-directly)
+
 Start the LLM server:
-```python
+```shell
+ollama run tinyllama
+```
+
+Start the service:
+```shell
+docker compose up -d
+```
+
+Make sure both the front and back are alive:
+```shell
+docker ps
+```
+You should see two containers with status `Up X minutes`.
+
+Go to http://localhost:9000/ to query your RAG.
+
+### Run directly
+
+Start the LLM server:
+```shell
 ollama run tinyllama
 ```
 
 In a fresh env:
 ```shell
-pip install -r requirements.txt
+pip install -r requirements-dev.txt
 ```
 
 You will need to set some env vars, either in a .env file at the project root, or just by exporting them like so:
@@ -51,6 +79,8 @@ Start the frontend demo
 python -m streamlit run frontend/front.py
 ```
 
+### Querying and loading the RAG
+
 You should then be able to login and chat to the bot:
 
 ![](docs/login_and_chat.gif)
@@ -75,3 +105,16 @@ Or serve them locally:
 mkdocs serve
 ```
 Then go to http://localhost:8000/
+
+
+## Architecture
+
+The whole goal of this repo is to decouple the "computing and LLM querying" part from the "rendering a user interface" part. We do this with a typical 3-tier architecture.
+
+![](docs/3t_architecture.png)
+
+- The [frontend](frontend.md) is the end user facing part. It reches out to the backend **ONLY** through the REST API. We provide a frontend demo here for convenience, but ultimately it could live in a completely different repo, and be written in a completely different language.
+- The [backend](backend/backend.md) provides a REST API to abstract RAG functionalities. It handles calls to LLMs, tracks conversations and users, handles the state management using a db, and much more. To get the gist of the backend, look at the of the API: http://0.0.0.0:8000/docs. It can be extended by plugins.
+- The [database](database.md) is only accessed by the backend and persists the state of the RAG application. The same plugins that extend the functionalities of the backed, can extend the data model of the DB.
+
+The structure of the repo mirrors this architecture.
diff --git a/backend/Dockerfile b/backend/Dockerfile
@@ -0,0 +1,24 @@
+FROM python:3.11-slim
+
+WORKDIR /app
+
+COPY requirements.txt requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Run as non-root user for security
+RUN useradd -m user
+RUN chown -R user:user /app
+USER user
+
+
+ENV PORT=8000
+ENV ADMIN_MODE=0
+ENV PYTHONPATH=.
+ENV DATABASE_URL=sqlite:///db/rag.sqlite3
+
+
+EXPOSE $PORT
+
+COPY . ./backend
+
+CMD python -m uvicorn backend.main:app --host 0.0.0.0 --port $PORT
diff --git a/backend/api_plugins/__init__.py b/backend/api_plugins/__init__.py
@@ -0,0 +1,3 @@
+from backend.api_plugins.insecure_authentication.insecure_authentication import insecure_authentication_routes
+from backend.api_plugins.secure_authentication.secure_authentication import authentication_routes
+from backend.api_plugins.sessions.sessions import session_routes
diff --git a/...re_authentcation/secure_authentication.py → ...e_authentication/secure_authentication.py b/...re_authentcation/secure_authentication.py → ...e_authentication/secure_authentication.py
diff --git a/...ins/secure_authentcation/users_tables.sql → ...ns/secure_authentication/users_tables.sql b/...ins/secure_authentcation/users_tables.sql → ...ns/secure_authentication/users_tables.sql
diff --git a/backend/config.py b/backend/config.py
@@ -22,8 +22,6 @@ class VectorStoreConfig:
     source: VectorStore | str
     source_config: dict
 
-    retriever_search_type: str
-    retriever_config: dict
     insertion_mode: str  # "None", "full", "incremental"
 
 @dataclass

diff --git a/backend/config.yaml b/backend/config.yaml
@@ -3,6 +3,7 @@ LLMConfig: &LLMConfig
   source_config:
     model: tinyllama
     temperature: 0
+    # base_url: http://host.docker.internal:11434  # Uncomment this line if you are running the RAG through Docker Compose
 
 VectorStoreConfig: &VectorStoreConfig
   source: Chroma
@@ -11,10 +12,6 @@ VectorStoreConfig: &VectorStoreConfig
     collection_metadata:
       hnsw:space: cosine
 
-  retriever_search_type: similarity_score_threshold
-  retriever_config:
-    k: 20
-    score_threshold: 0.5
   insertion_mode: null
 
 EmbeddingModelConfig: &EmbeddingModelConfig

diff --git a/backend/main.py b/backend/main.py
@@ -9,7 +9,7 @@
 # Initialize a RAG as discribed in the config.yaml file
 # https://artefactory.github.io/skaff-rag-accelerator/backend/rag_ragconfig/
 rag = RAG(config=Path(__file__).parent / "config.yaml")
-chain = rag.get_chain(memory=True)
+chain = rag.get_chain()
 
 
 # Create a minimal RAG server based on langserve
@@ -19,6 +19,4 @@
     title="RAG Accelerator",
     description="A RAG-based question answering API",
 )
-auth = authentication_routes(app)
-session_routes(app, authentication=auth)
 add_routes(app, chain)
diff --git a/backend/rag_components/chain.py b/backend/rag_components/chain.py
diff --git a/backend/rag_components/chain_links/answer_question_from_docs_and_history.py b/backend/rag_components/chain_links/answer_question_from_docs_and_history.py
@@ -0,0 +1,27 @@
+"""This chain answers the provided question based on documents it retreives and the conversation history"""
+from langchain_core.retrievers import BaseRetriever
+from pydantic import BaseModel
+from backend.rag_components.chain_links.rag_basic import rag_basic
+from backend.rag_components.chain_links.condense_question import condense_question
+
+from backend.rag_components.chain_links.documented_runnable import DocumentedRunnable
+from backend.rag_components.chain_links.retrieve_and_format_docs import fetch_docs_chain
+
+
+class QuestionWithHistory(BaseModel):
+    question: str
+    chat_history: str
+
+
+class Response(BaseModel):
+    response: str
+
+
+def answer_question_from_docs_and_history_chain(llm, retriever: BaseRetriever) -> DocumentedRunnable:
+    reformulate_question = condense_question(llm)
+    answer_question = rag_basic(llm, retriever)
+
+    chain =  reformulate_question | answer_question
+    typed_chain = chain.with_types(input_type=QuestionWithHistory, output_type=Response)
+
+    return DocumentedRunnable(typed_chain, chain_name="Answer question from docs and history", user_doc=__doc__)
diff --git a/backend/rag_components/chain_links/condense_question.py b/backend/rag_components/chain_links/condense_question.py
@@ -0,0 +1,38 @@
+"""This chain condenses the chat history and the question into one standalone question."""
+from langchain_core.prompts import PromptTemplate
+from langchain_core.output_parsers import StrOutputParser
+from pydantic import BaseModel
+
+from backend.rag_components.chain_links.documented_runnable import DocumentedRunnable
+
+
+class QuestionWithChatHistory(BaseModel):
+    question: str
+    chat_history: str
+
+
+class StandaloneQuestion(BaseModel):
+    standalone_question: str
+
+
+prompt = """
+<s>[INST] <<SYS>>
+Given the conversation history and the following question, can you rephrase the user's question in its original language so that it is self-sufficient. You are presented with a conversation that may contain some spelling mistakes and grammatical errors, but your goal is to understand the underlying question. Make sure to avoid the use of unclear pronouns.
+
+If the question is already self-sufficient, return the original question. If it seem the user is authorizing the chatbot to answer without specific context, make sure to reflect that in the rephrased question.
+<</SYS>>
+
+Chat history: {chat_history}
+
+Question: {question}
+[/INST]
+""" # noqa: E501
+
+
+def condense_question(llm) -> DocumentedRunnable:
+    condense_question_prompt = PromptTemplate.from_template(prompt)  # chat_history, question
+
+    standalone_question = condense_question_prompt | llm | StrOutputParser()
+
+    typed_chain = standalone_question.with_types(input_type=QuestionWithChatHistory, output_type=StandaloneQuestion)
+    return DocumentedRunnable(typed_chain, chain_name="Condense question and history", prompt=prompt, user_doc=__doc__)