A modern, scalable, and extensible web crawler designed for efficient distributed crawling and data extraction. Built using asynchronous I/O, robust logging, plugin architecture, and distributed task processing with Celery.
-
Asynchronous Crawling: Utilizes aiohttp and asyncio for non-blocking network I/O, allowing concurrent requests and improved performance.
-
Distributed Task Processing: Integrated with Celery and Redis to distribute crawling tasks across multiple workers and machines.
-
Database Persistence: Uses PostgreSQL (via SQLAlchemy) to store crawled pages efficiently with indexing and transactional support.
-
Robust Logging: A dedicated logging framework supports console output, rotating file logs, and SQLite logging for persistent diagnostics.
-
Plugin Architecture: Easily extend the crawler with custom plugins for processing, filtering, or transforming crawled data without modifying core code.
-
URL Normalization: Enhanced URL normalization removes trailing slashes and sorts query parameters to avoid duplicate processing.
The project is organized into modular components, each handling a specific aspect of the crawler:
-
domain.py: Extracts and processes domain information from URLs.
-
link_finder.py: Parses HTML using BeautifulSoup with the lxml parser and extracts hyperlinks.
-
utils.py: Provides URL normalization and logging initialization utilities.
-
storage.py: Manages database connectivity and persistence using SQLAlchemy and PostgreSQL.
-
spider.py: Implements the core asynchronous crawler that fetches pages, processes content, and enqueues discovered links.
-
plugin.py: Contains the plugin interface and a manager for registering and running custom plugins.
-
tasks.py: Defines Celery tasks to enable distributed crawling across workers.
-
main.py: Serves as the entry point for local crawling execution.
-
config.yaml / config.py: Handles configuration and environment variable overrides.
Make sure you have the following installed:
- Python 3.8+
- PostgreSQL
- Redis
-
Clone the repo:
bash git clone https://github.com/roshanlam/spider.git
bash cd spider
-
Create a Virtual Environment:
python3 -m venv venv
source venv/bin/activate
- Install Dependencies:
pip install -r requirements.txt
- Configure PostgreSQL and Redis: Ensure your PostgreSQL database and Redis server are running. Update config.yaml with your database URL and Redis broker settings.
The crawler is configured via the config.yaml file. Here is an configuration:
threads: 8
rate_limit: 1 # seconds between requests
user_agent: "MyCrawler/1.0"
timeout: 10
start_url: "https://example.com"
database:
url: "postgresql://user:password@localhost/crawlerdb"
celery:
broker_url: "redis://localhost:6379/0"
result_backend: "redis://localhost:6379/0"
To start the crawler locally:
python -m spider.main
This will initialize the crawler, load the configured start URL, and begin asynchronous crawling.
- Start the Celery Worker:
celery -A spider.tasks.celery_app worker --loglevel=info
- Dispatch a Crawl Task: In a seperate terminal, from the root of the project, run the run_crawler.py file. Make sure to change the URL in the file to queue that URL to the celery instance
python run_crawler.py
This will distribute the crawl task across available Celery workers.
The crawler supports custom plugins to extend functionality. Plugins can be used to process, filter, or extract additional data from crawled pages. Read the Plugin.md file for more info.
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a feature branch (
git checkout -b feature/my-feature
). - Commit your changes with clear and descriptive commit messages.
- Push your branch (
git push origin feature/my-feature
). - Open a pull request detailing your changes and improvements.