Simple scraper for Diario de Leon. It scrapes sitemap and saves them in a JSON file.
Set up the virtual environment and install the requirements with Poetry:
poetry install
Run the scraper with the following command:
scrapy crawl diariodeleon
Cache: It is using the HttpCacheMiddleware to cache the requestss therefore if you stop the scraper it would not be a big deal: the next time you run it, it will scrape all the previously scraped results very quickly because they will be stored in the cache avoiding making requests to diariodeleon.com
Policy: It is using a very polite scraping policy, you dont have to worry about being blocked.
Proxies: In case of desiring to use proxies, you can set them in the settings.py file inside the ROTATING_PROXY_LIST variable.
The results are saved in a JSON file inside the /results directory.