diff --git a/README.md b/README.md index ca95336..736bd94 100644 --- a/README.md +++ b/README.md @@ -26,6 +26,20 @@ including text, images, and tables. It utilizes a retriever to store and manage ![alt text](https://blog.langchain.dev/content/images/size/w1600/2023/10/image-22.png) +- **Option 1**: This option involves retrieving the raw image directly from the dataset and combining it with the raw table and text data. The combined raw data is then processed by a Multimodal LLM to generate an answer. This approach uses the complete, unprocessed image data in conjunction with textual information. + - Ingestion : Multimodal embeddings + - RAG chain : Multimodal LLM + +- **Option 2**: In this option, instead of using the raw image, an image summary is retrieved. This summary, along with the raw table and text data, is fed into a Text LLM to generate an answer. + - Ingestion : Multimodal LLM (for summarization) + Text embeddings + - RAG chain : Text LLM + +- **Option 3**: This option also retrieves an image summary, but unlike Option 2, it passes the raw image to a Multimodal LLM for synthesis along with the raw table and text data. + - Ingestion : Multimodal LLM (for summarization) + Text embeddings + - RAG chain : Multimodal LLM + +For all options, we can choose to treat tables as text or images. + ### RAG Option 1 Folder: [backend/rag_1](backend/rag_1) @@ -56,14 +70,14 @@ Folder: [backend/rag_2](backend/rag_2) Method: - Use a multimodal LLM (such as GPT-4V, LLaVA, or FUYU-8b) to produce text summaries from images. -- Embed and retrieve text. -- Pass text chunks to a text LLM for answer synthesis. +- Embed and retrieve image summaries and texts chunks. +- Pass image summaries and text chunks to a text LLM for answer synthesis. Backend: - Use the [multi-vector retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector) with [Chroma](https://www.trychroma.com/) to store raw text (or tables) and images along with their summaries for retrieval. -- Use GPT-4V for image summarization (for retrieval) +- Use GPT-4V for image summarization. - Use GPT-4 for final answer synthesis from join review of image summaries and texts (or tables). Parameters: