This project implements an Agentic Retrieval-Augmented Generation (RAG) system that allows users to retrieve answers from uploaded PDFs, specified website URLs, or a combination of both. The system uses an intelligent agent to decide whether a query can be answered based on the provided sources or needs to fall back on online searches.
- PDF Retrieval: Upload PDF files and extract information for question answering.
- Website Retrieval: Provide URLs to extract and use content for answering queries.
- Combined Query Handling: Simultaneously process PDFs and URLs to retrieve answers.
- Agent Logic:
- First checks if the answer exists in the uploaded PDF.
- If not found, checks the website content.
- If unavailable in both, declares the question as outside the RAG database and refrains from answering.
- Fallback Search: If no relevant information is found in the provided data, an online search is used to retrieve relevant context.
- Streamlit: User interface.
- PyPDF2: Extract text from PDF files.
- BeautifulSoup: Parse and clean website content.
- OpenAI API: Generate embeddings and answer questions.
- Qdrant: Vector database for semantic search.
- DuckDuckGo Search: Online search fallback for out-of-database queries.
-
Clone the Repository:
git clone https://github.com/rajveersinghcse/Agentic_RAG cd Agentic_RAG
-
Install Dependencies:
pip install -r requirements.txt
-
Set Up Qdrant:
- Download and install Qdrant.
- Start Qdrant on
http://localhost:6333
.
-
Run the Application:
streamlit run app.py
-
Configure API Key:
- Enter your OpenAI API Key in the designated input field in the app.
-
Upload Data:
- Upload PDF files or provide website URLs (comma-separated).
- Optionally, enable crawling to extract content from all linked pages.
-
Process Data:
- Click "Process and Index Documents" to generate embeddings and store them in the Qdrant database.
-
Ask Questions:
- Enter your question in the input field.
- The agent determines the source of the answer:
- Retrieves from PDF if present.
- Falls back to website if not in PDF.
- If neither, performs an online search (optional) or states that the question is outside the RAG database.
- PDF Search: If the answer is found in the uploaded PDFs, it is retrieved and displayed.
- Website Search: If the answer is not in PDFs, it searches through the provided website content.
- Fallback Search: If neither source contains the answer, the question is identified as outside the RAG database.
- OpenAI API Key: Required for embeddings and question-answering models.
- Qdrant: Must be running locally or configured to a remote host in the code.
- Python 3.8+ (I used 3.12.7)
- Valid OpenAI API Key
- Running instance of Qdrant
streamlit
PyPDF2
beautifulsoup4
qdrant-client
litellm
duckduckgo_search
langchain_text_splitters
Install all dependencies with:
pip install -r requirements.txt
The agent processes both and prioritizes the PDFs. If the answer is not in PDFs, it checks the websites.
No. If the answer isn't in the PDFs or URLs, the agent either performs an online search (if enabled) or states that it can't answer.
Ensure Qdrant is properly installed and started on localhost:6333
before indexing documents.