Skip to content

Commit

Permalink
upd: docker deployment + chain links + DocumentedRunnable
Browse files Browse the repository at this point in the history
  • Loading branch information
AlexisVLRT committed Mar 6, 2024
1 parent 0eeef24 commit 9490ae2
Show file tree
Hide file tree
Showing 33 changed files with 947 additions and 919 deletions.
53 changes: 48 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,19 @@
# skaff-rag-accelerator

This is a starter kit to prototype locally, deploy on any cloud, and industrialize a Retrival-Augmented Generation (RAG) service.

This is a starter kit to deploy a modularizable RAG locally or on the cloud (or across multiple clouds)

## Features

- A configurable RAG setup based around Langchain ([Check out the configuration cookbook here](https://artefactory.github.io/skaff-rag-accelerator/cookbook/))
- `RAG` and `RagConfig` python classes to help you set things up
- `RAG` and `RagConfig` python classes that manage components (vector store, llm, retreiver, ...)
- A REST API based on Langserve + FastAPI to provide easy access to the RAG as a web backend
- Optional API plugins for secure user authentication, session management, ...
- `Chain links` primitive that facilitates chain building and allows documentation generation
- A demo Streamlit to serve as a basic working frontend
- `Dockerfiles` and `docker-compose` to make deployments easier and more flexible
- A document loader for the RAG
- Optional plugins for secure user authentication and session management


## Quickstart

Expand All @@ -25,14 +29,38 @@ Duration: ~15 minutes.
- A few GB of disk space to host the models
- Tested with python 3.11 (may work with other versions)

### Run using docker compose

If you have docker installed and running you can run the whole RAG app using it. [Otherwise, skip to the "Run directly" section](#run-directly)

Start the LLM server:
```python
```shell
ollama run tinyllama
```

Start the service:
```shell
docker compose up -d
```

Make sure both the front and back are alive:
```shell
docker ps
```
You should see two containers with status `Up X minutes`.

Go to http://localhost:9000/ to query your RAG.

### Run directly

Start the LLM server:
```shell
ollama run tinyllama
```

In a fresh env:
```shell
pip install -r requirements.txt
pip install -r requirements-dev.txt
```

You will need to set some env vars, either in a .env file at the project root, or just by exporting them like so:
Expand All @@ -51,6 +79,8 @@ Start the frontend demo
python -m streamlit run frontend/front.py
```

### Querying and loading the RAG

You should then be able to login and chat to the bot:

![](docs/login_and_chat.gif)
Expand All @@ -75,3 +105,16 @@ Or serve them locally:
mkdocs serve
```
Then go to http://localhost:8000/


## Architecture

The whole goal of this repo is to decouple the "computing and LLM querying" part from the "rendering a user interface" part. We do this with a typical 3-tier architecture.

![](docs/3t_architecture.png)

- The [frontend](frontend.md) is the end user facing part. It reches out to the backend **ONLY** through the REST API. We provide a frontend demo here for convenience, but ultimately it could live in a completely different repo, and be written in a completely different language.
- The [backend](backend/backend.md) provides a REST API to abstract RAG functionalities. It handles calls to LLMs, tracks conversations and users, handles the state management using a db, and much more. To get the gist of the backend, look at the of the API: http://0.0.0.0:8000/docs. It can be extended by plugins.
- The [database](database.md) is only accessed by the backend and persists the state of the RAG application. The same plugins that extend the functionalities of the backed, can extend the data model of the DB.

The structure of the repo mirrors this architecture.
24 changes: 24 additions & 0 deletions backend/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Run as non-root user for security
RUN useradd -m user
RUN chown -R user:user /app
USER user


ENV PORT=8000
ENV ADMIN_MODE=0
ENV PYTHONPATH=.
ENV DATABASE_URL=sqlite:///db/rag.sqlite3


EXPOSE $PORT

COPY . ./backend

CMD python -m uvicorn backend.main:app --host 0.0.0.0 --port $PORT
3 changes: 3 additions & 0 deletions backend/api_plugins/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from backend.api_plugins.insecure_authentication.insecure_authentication import insecure_authentication_routes
from backend.api_plugins.secure_authentication.secure_authentication import authentication_routes
from backend.api_plugins.sessions.sessions import session_routes
2 changes: 0 additions & 2 deletions backend/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,6 @@ class VectorStoreConfig:
source: VectorStore | str
source_config: dict

retriever_search_type: str
retriever_config: dict
insertion_mode: str # "None", "full", "incremental"

@dataclass
Expand Down
5 changes: 1 addition & 4 deletions backend/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ LLMConfig: &LLMConfig
source_config:
model: tinyllama
temperature: 0
# base_url: http://host.docker.internal:11434 # Uncomment this line if you are running the RAG through Docker Compose

VectorStoreConfig: &VectorStoreConfig
source: Chroma
Expand All @@ -11,10 +12,6 @@ VectorStoreConfig: &VectorStoreConfig
collection_metadata:
hnsw:space: cosine

retriever_search_type: similarity_score_threshold
retriever_config:
k: 20
score_threshold: 0.5
insertion_mode: null

EmbeddingModelConfig: &EmbeddingModelConfig
Expand Down
4 changes: 1 addition & 3 deletions backend/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
# Initialize a RAG as discribed in the config.yaml file
# https://artefactory.github.io/skaff-rag-accelerator/backend/rag_ragconfig/
rag = RAG(config=Path(__file__).parent / "config.yaml")
chain = rag.get_chain(memory=True)
chain = rag.get_chain()


# Create a minimal RAG server based on langserve
Expand All @@ -19,6 +19,4 @@
title="RAG Accelerator",
description="A RAG-based question answering API",
)
auth = authentication_routes(app)
session_routes(app, authentication=auth)
add_routes(app, chain)
54 changes: 0 additions & 54 deletions backend/rag_components/chain.py

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
"""This chain answers the provided question based on documents it retreives and the conversation history"""
from langchain_core.retrievers import BaseRetriever
from pydantic import BaseModel
from backend.rag_components.chain_links.rag_basic import rag_basic
from backend.rag_components.chain_links.condense_question import condense_question

from backend.rag_components.chain_links.documented_runnable import DocumentedRunnable
from backend.rag_components.chain_links.retrieve_and_format_docs import fetch_docs_chain


class QuestionWithHistory(BaseModel):
question: str
chat_history: str


class Response(BaseModel):
response: str


def answer_question_from_docs_and_history_chain(llm, retriever: BaseRetriever) -> DocumentedRunnable:
reformulate_question = condense_question(llm)
answer_question = rag_basic(llm, retriever)

chain = reformulate_question | answer_question
typed_chain = chain.with_types(input_type=QuestionWithHistory, output_type=Response)

return DocumentedRunnable(typed_chain, chain_name="Answer question from docs and history", user_doc=__doc__)
38 changes: 38 additions & 0 deletions backend/rag_components/chain_links/condense_question.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
"""This chain condenses the chat history and the question into one standalone question."""
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from pydantic import BaseModel

from backend.rag_components.chain_links.documented_runnable import DocumentedRunnable


class QuestionWithChatHistory(BaseModel):
question: str
chat_history: str


class StandaloneQuestion(BaseModel):
standalone_question: str


prompt = """
<s>[INST] <<SYS>>
Given the conversation history and the following question, can you rephrase the user's question in its original language so that it is self-sufficient. You are presented with a conversation that may contain some spelling mistakes and grammatical errors, but your goal is to understand the underlying question. Make sure to avoid the use of unclear pronouns.
If the question is already self-sufficient, return the original question. If it seem the user is authorizing the chatbot to answer without specific context, make sure to reflect that in the rephrased question.
<</SYS>>
Chat history: {chat_history}
Question: {question}
[/INST]
""" # noqa: E501


def condense_question(llm) -> DocumentedRunnable:
condense_question_prompt = PromptTemplate.from_template(prompt) # chat_history, question

standalone_question = condense_question_prompt | llm | StrOutputParser()

typed_chain = standalone_question.with_types(input_type=QuestionWithChatHistory, output_type=StandaloneQuestion)
return DocumentedRunnable(typed_chain, chain_name="Condense question and history", prompt=prompt, user_doc=__doc__)
Loading

0 comments on commit 9490ae2

Please sign in to comment.