Traditional LLMs have a limited Scope and their outputs just depend on the data it is originally trained on. If a user asks a query, the response of which depends on a specific chunk of information which might not be related to the general data llms are trained on then the output obtained is not reliable and may lead to false decisions. RAG is a great way to overcome this issue, also it is comparitively efficient than other techniques like fine-tuning. Due to the dawn of Multimodal LLMs, the limited scope of RAGs to utilise only text documents is expanding to include other media like images. In this project we aim to develop a topic specific chatbot that will havev additional context, using MultiModal RAG.It will be able to take text and image inputs and give reliable output based on the extra data it will be fed.
- Mohammed Bhadsorawala
- Kshitij Shah
- Tvisha Vedant
- Ninad Shegokar
- Vaishvi Khandelwal
- Tanish Bhamare
- Python
- Pytorch
- Streamlit
- Langchain
- Faiss
-
Document Processing: The input document (text, PDFs, or other formats) is first divided into smaller, manageable chunks using a chunker.
-
Embedding Creation: These chunks are then converted into numerical representations called embeddings and stored in a vector database for easy retrieval.
-
Query Matching: When a user submits a query, the system searches the vector database for the closest matching chunk based on the stored embeddings.
-
Context Retrieval: The relevant chunk is passed to a prompt composer which adds additional context to form a detailed prompt.
-
LLM Response: The composed prompt is then sent to the LLM (Large Language Model), which uses the context to generate a tailored and informative response.
This flow ensures that the system retrieves specific, relevant information, making the responses more accurate and context-aware.
• For PDFs, text and images are extracted, chunked, tokenized, and saved separately. The retrieval process from the database is the same as for text inputs.
• Images are processed through multiple models: YOLO for object detection, Place365 for background analysis, and OpenCV for dominant color extraction. The LLM uses this context to generate a response.
• For videos, frames are extracted at 1 frame per second and stored in a directory. Descriptions are generated based on both the current and previous frames to capture the video's flow.
• For web scraping, text and images are extracted, chunked, and tokenized using BeautifulSoup. They are stored separately, and the same retrieval process as with text is applied to generate responses.
• For videos, frames are extracted at a rate of 1 frame per second (adjustable but limited due to computational constraints) and stored in a directory. Each frame is processed in two parts:
- Visual (Images): Processed similarly to standalone images.
- Audio: Processed using the Whisper model. The description of each frame is generated based on both the current frame and the previous frame, ensuring continuity and context.
demo_vid.mp4
edit-demo-3.1.mp4
-
Extracting Information from Images: Accurately retrieving meaningful data from images required using multiple models for object detection, background analysis, and color identification.
-
Combining Audio and Visual Contexts: Synchronizing audio and visual inputs in videos was complex, as it required maintaining context and continuity between both sources.
-
Scraping Dynamic Websites: Extracting data from dynamic websites posed challenges due to constantly changing content, structure, and technical barriers like asynchronous loading.
-
Enhanced Image Retrieval: Improve accuracy in retrieving relevant images through better object detection and context extraction.
-
Audio Input: Allow users to interact with the model using audio inputs, expanding beyond text and images.
-
Text-to-Speech Output: Enable the model to read responses aloud, creating a more conversational experience.
-
Voice Interaction: Facilitate full voice-based conversations by combining audio input and text-to-speech capabilities.
-
Image Generation: Implement image generation based on user input, adding dynamic and interactive visual responses.
- We are extremely grateful to our mentors, Kshitij, Mohammed and Tvisha.
- Project X under Community of Coders (COC) at VJTI, Mumbai.