- Python >= 3.7
- Scrapy >= 2.0 (!= 2.4.0)
- Playwright >= 1.15
Creating the environment:
python -m venv env
Active the env.:
source /path/to/venv/bin/activate
Download the packages:
pip3 install -r requirements.txt
Move to the project, in my case the name is ratescrapy.
Install the required browsers
playwright install
It's also possible to install only a subset of the available browsers:
playwright install firefox chromium
You can run with the custom configuration:
scrapy crawl <spider_name>
with params for save in .json, for example:
scrapy crawl <spider_name> -o output.json
or create your own custom settings.
- This scrapy script respect the ROBOTSTXT_OBEY (ROBOTSTXT_OBEY = True).
- The repo need some updates, example: for a better env and performance you can delete selenium and just use playwright.