This is a demo of the Vision RAG (V-RAG) architecture.
The V-RAG architecture utilizes a vision language model (VLM) to embed pages of PDF files (or any other document) as vectors directly, without the tedious chunking process.
Vision.is.All.You.Need.-.Demo.Video.mp4
Check out the background blog post: https://softlandia.fi/en/blog/building-a-rag-tired-of-chunking-maybe-vision-is-all-you-need
- The pages of a PDF file are converted to images.
- In theory these images can be anything, but the current demo uses PDF files since the underlying model has been trained on PDF files
pypdfium
is used to convert the PDF pages to images
- The images are passed through a VLM to get the embeddings.
- ColPali is used as the VLM in this demo
- The embeddings are stored in a database
- QDrant is used as the vector database in this demo
- The user passes a query to the V-RAG system
- The query is passed through the VLM to get the query embedding
- The query embedding is used to search the vector database for similar embeddings
- The user query and images of the best matches from the search are passed again to a model that can understand images
- we use GPT4o or GPT4o-mini in this demo
- The model generates a response based on the query and the images
Make sure tou have an account in Hugging Face. Make sure you are logged into Hugging Face using transformers-cli login
.
For OpenAI API, you need to have an API key. You can get it from here: https://platform.openai.com/account/api-keys
You can place the keys to the dotenv file:
OPENAI_API_KEY=
HF_TOKEN=
Then, you can run the demo by following these steps:
- Install Python 3.11 or higher
pip install modal
modal setup
modal serve .\main.py
- Open your browser and go to the url provided by Modal and append
/docs
to the url - Click on the
POST /collections
endpoint - Click on the
Try it out
button - Upload a PDF file
- Click on the
Execute
button
This will index the PDF file in to in-memory vector database. This will take some time depending on the size of the PDF file and the GPU you are using in Modal. The current demo is using a A10G GPU.
You can now search for similar pages using the POST /search
endpoint.
The endpoint sends the page images and the query to the OpenAI API and returns the response.
You can also use the frontend to interact with the API. To setup the frontend for local development, follow these steps:
- Install Node.js
cd frontend
- modify you
.env.development
file and add yourVITE_BACKEND_URL
- modify you
npm install
npm run dev
This will start the frontend on http://localhost:5173
You can deploy the demo to Modal using the following steps:
- Modify you
.env.production
file infrontend
dir and add yourVITE_BACKEND_URL
for the production environment - Build the frontend
npm run build
- this will create adist
folder with the frontend bundle modal deploy .\main.py