Spider

A modern, scalable, and extensible web crawler designed for efficient distributed crawling and data extraction. Built using asynchronous I/O, robust logging, plugin architecture, and distributed task processing with Celery.

Features

Asynchronous Crawling: Utilizes aiohttp and asyncio for non-blocking network I/O, allowing concurrent requests and improved performance.
Distributed Task Processing: Integrated with Celery and Redis to distribute crawling tasks across multiple workers and machines.
Database Persistence: Uses PostgreSQL (via SQLAlchemy) to store crawled pages efficiently with indexing and transactional support.
Robust Logging: A dedicated logging framework supports console output, rotating file logs, and SQLite logging for persistent diagnostics.
Plugin Architecture: Easily extend the crawler with custom plugins for processing, filtering, or transforming crawled data without modifying core code.
URL Normalization: Enhanced URL normalization removes trailing slashes and sorts query parameters to avoid duplicate processing.

Architecture Overview

The project is organized into modular components, each handling a specific aspect of the crawler:

domain.py: Extracts and processes domain information from URLs.
link_finder.py: Parses HTML using BeautifulSoup with the lxml parser and extracts hyperlinks.
utils.py: Provides URL normalization and logging initialization utilities.
storage.py: Manages database connectivity and persistence using SQLAlchemy and PostgreSQL.
spider.py: Implements the core asynchronous crawler that fetches pages, processes content, and enqueues discovered links.
plugin.py: Contains the plugin interface and a manager for registering and running custom plugins.
tasks.py: Defines Celery tasks to enable distributed crawling across workers.
main.py: Serves as the entry point for local crawling execution.
config.yaml / config.py: Handles configuration and environment variable overrides.

Installation

Make sure you have the following installed:

Python 3.8+
PostgreSQL
Redis

Clone the repo: bash git clone https://github.com/roshanlam/spider.git bash cd spider
Create a Virtual Environment:

python3 -m venv venv
source venv/bin/activate

Install Dependencies:

pip install -r requirements.txt

Configure PostgreSQL and Redis: Ensure your PostgreSQL database and Redis server are running. Update config.yaml with your database URL and Redis broker settings.

Configuration

The crawler is configured via the config.yaml file. Here is an configuration:

threads: 8
rate_limit: 1         # seconds between requests
user_agent: "MyCrawler/1.0"
timeout: 10
start_url: "https://example.com"

database:
  url: "postgresql://user:password@localhost/crawlerdb"

celery:
  broker_url: "redis://localhost:6379/0"
  result_backend: "redis://localhost:6379/0"

Usage

Running Locally

To start the crawler locally:

python -m spider.main

This will initialize the crawler, load the configured start URL, and begin asynchronous crawling.

Running in Distributed Mode

Start the Celery Worker:

celery -A spider.tasks.celery_app worker --loglevel=info

Dispatch a Crawl Task: In a seperate terminal, from the root of the project, run the run_crawler.py file. Make sure to change the URL in the file to queue that URL to the celery instance

python run_crawler.py

This will distribute the crawl task across available Celery workers.

Plugin System

The crawler supports custom plugins to extend functionality. Plugins can be used to process, filter, or extract additional data from crawled pages. Read the Plugin.md file for more info.

Development & Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.
Create a feature branch (git checkout -b feature/my-feature).
Commit your changes with clear and descriptive commit messages.
Push your branch (git push origin feature/my-feature).
Open a pull request detailing your changes and improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
client		client
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Plugin.md		Plugin.md
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_crawler.py		run_crawler.py
setup.cfg		setup.cfg
spider.png		spider.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spider

Features

Architecture Overview

Installation

Configuration

Usage

Running Locally

Running in Distributed Mode

Plugin System

Development & Contributing

About

Releases

Packages

Contributors 3

Languages

License

roshanlam/Spider

Folders and files

Latest commit

History

Repository files navigation

Spider

Features

Architecture Overview

Installation

Configuration

Usage

Running Locally

Running in Distributed Mode

Plugin System

Development & Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages