Lamatic Crawler App is a Next.js application designed to crawl a sitemap and index its pages. It features a chatbot that allows users to ask questions based on the crawled data. It performs Retrieval-Augmented Generation (RAG) to answer user questions based on the indexed data. This application leverages OpenAI's API and Weaviate for data indexing and retrieval.
- Node.js and npm
- Python 3.12 or later
- Weaviate API Key
- OpenAI API Key
- Google GEMINI API (FREE)
- Sitemap URL crawling and indexing.
- Interactive chatbot for querying the indexed data.
- Integration with OpenAI API for natural language processing.
- Integration with Google Gemini for natural language processing.
- Easy setup and configuration with environment variables.
To get started with the Lamatic Crawler App, follow these steps:
-
Clone the repository:
git clone https://github.com/Swag19602/lamatic-crawler-app cd lamatic-crawler-app
-
Install dependencies:
npm install
-
Create and configure the environment file:
Create a
.env.local
file in the root directory and add the following environment variables:SUPABASE_KEY=your_supabase_key WEAVECLIENT_HOST=your_weaveclient_host WEAVECLIENT_KEY=your_weaveclient_key SUPABASE_URL=your_supabase_url OPENAI_API_KEY=your_openapi_key
-
Create a virtual environment and install dependencies:
cd gradio python3 -m venv venv source venv/bin/activate pip install -r requirements.txt
-
Create and configure the environment file:
Create a
.env.local
file in the root directory and add the following environment variables:WEAVECLIENT_HOST=your_weaveclient_host WEAVECLIENT_KEY=your_weaveclient_key OPENAI_API_KEY=your_openapi_key GEMINI_API_KEY=your_gemini_key
-
Run the Gradio app:
python app.py
-
Run the development server:
npm run dev
-
Open the application:
Visit
http://localhost:3000
in your web browser. -
Crawl a sitemap:
Enter the sitemap URL into the application to begin crawling and indexing the pages.
-
Use the chatbot:
You will be navigated to to chatbot interface once crawling and indexing is successfull, you can use the chatbot to ask questions based on the indexed data.
- Ensure the URLs in your crawledData.json are accessible and contain meaningful content for better query results.
- Adjust the Gradio app and indexing script as needed to fit your specific requirements and environment.
- Environment Variables:
GEMINI_API_KEY
: Your Google GEMINI API key for natural language processing. (FREE)OPENAI_API_KEY
: Your OpenAI API key for natural language processing.WEAVIATE_HOST
: Host address for the Weaviate instance used for data indexin- `WEAVECLIENT_KEY
: API key for accessing Weaviate.SUPABASE_KEY
: API key for accessing Supabase.SUPABASE_URL
: URL for the Supabase instance.
Contributions are welcome! Please follow these steps to contribute:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch
). - Make your changes and commit them (
git commit -m 'Add new feature'
). - Push to the branch (
git push origin feature-branch
). - Open a Pull Request.
For questions or suggestions, please contact:
- Name: Swagatam Bhattacharjee
- GitHub: https://github.com/Swag19602/
- LinkedIn: https://www.linkedin.com/in/swagatam-bhattacharjee-5aa00a1b2/
- Portfolio: https://swag-portfolio.vercel.app/
- Email: swagatambhattacharjee02@gmail.com
A demo video of the Lamatic Crawler App can be found : https://www.loom.com/share/0056c702d9474f5c8fa577ed72c407ad?sid=c851e3fc-0945-4c4c-970f-626d2c6165bc