Matrix Web Scraper Installation and Running Guide

Prerequisites

Debian-based system (e.g., Debian 10 or Ubuntu 20.04)
sudo access

Installation Steps

Update system and install dependencies:

sudo apt update && sudo apt upgrade -y
sudo apt install -y git python3 python3-pip python3-venv tor proxychains4 iptables-persistent dnscrypt-proxy redis-server libpq-dev build-essential libssl-dev libffi-dev python3-dev postgresql postgresql-contrib

Clone the repository: origin main

git clone https://github.com/alexfrontendfr/scrapy-playwright.git cd scrapy-playwright
Set up virtual environment:

python3 -m venv venv source venv/bin/activate
Install project dependencies:

pip install --upgrade pip pip install -r requirements.txt playwright install
Configure TOR

sudo nano /etc/tor/torrc
Add or uncomment the following lines:

SocksPort 9050 ControlPort 9051 HashedControlPassword 16:01234567890ABCDEF01234567890ABCDEF01234567890ABCDEF01234567

Generate a hashed password: tor --hash-password "your_password_here"

Replace the HashedControlPassword line with the generated hash. Restart TOR: sudo systemctl restart tor

Configure ProxyChains: sudo nano /etc/proxychains4.conf
Add the following line at the end of the file: socks5 127.0.0.1 9050
Configure firewall:

Create configure_firewall.sh in the project root and add the content from step 5 in the previous message. Apply firewall rules: sudo bash configure_firewall.sh
Configure DNSCrypt-proxy:

sudo nano /etc/dnscrypt-proxy/dnscrypt-proxy.toml

Update the settings as mentioned in step 6 of the previous message. Restart DNSCrypt-proxy:

sudo systemctl restart dnscrypt-proxy

Update the settings as mentioned in step 6 of the previous message. Restart DNSCrypt-proxy: sudo systemctl restart dnscrypt-proxy

Update resolv.conf: echo "nameserver 127.0.0.1" | sudo tee /etc/resolv.conf echo "options edns0" | sudo tee -a /etc/resolv.conf

Set up PostgreSQL: sudo -u postgres psql -c "CREATE DATABASE matrix_scraper;" sudo -u postgres psql -c "CREATE USER scraper_user WITH PASSWORD 'your_password';" sudo -u postgres psql -c "GRANT ALL PRIVILEGES ON DATABASE matrix_scraper TO scraper_user;"
Configure environment variables:

Create a .env file in the project root:

Initialize the database:

python

In the Python interpreter:

from scraper_bot.models import Base from sqlalchemy import create_engine import os from dotenv import load_dotenv

load_dotenv()

engine = create_engine(os.getenv('DATABASE_URL')) Base.metadata.create_all(engine)

exit()

Running the Project

Start Redis server: sudo systemctl start redis-server
Start Celery worker:

celery -A scraper_bot.tasks worker --loglevel=info
Run the Flask application:

python scraper_bot/app.py
Access the web interface:

Open a web browser and navigate to http://localhost:5000 Usage

Enter your search query in the provided field. Select the desired search engines (Google, Bing, DuckDuckGo, Onion). Set the result limit. Choose whether to use TOR for anonymous scraping. Click "Start Search" and wait for the results.

Maintenance

To clear the cache: Send a POST request to http://localhost:5000/clear_cache To renew the TOR IP: Send a POST request to http://localhost:5000/renew_tor_ip

Troubleshooting

If you encounter any issues with TOR, try restarting the service

sudo systemctl restart tor

If the database connection fails, ensure PostgreSQL is running:

sudo systemctl status postgresql

For any other issues, check the scraper.log file for error messages.

Name		Name	Last commit message	Last commit date
Latest commit History 294 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
scraper_bot		scraper_bot
scrapy_playwright		scrapy_playwright
venv		venv
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
configure_firewall.sh		configure_firewall.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
setup.cfg		setup.cfg
setup.py		setup.py
setup_guide.txt		setup_guide.txt
tree_structure.py		tree_structure.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Matrix Web Scraper Installation and Running Guide

Prerequisites

Installation Steps

About

Releases

Packages

Contributors 11

Languages

License

alexfrontendfr/scrapy-playwright

Folders and files

Latest commit

History

Repository files navigation

Matrix Web Scraper Installation and Running Guide

Prerequisites

Installation Steps

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 11

Languages

Packages