Scraper for Diario de Leon

Simple scraper for Diario de Leon. It scrapes sitemap and saves them in a JSON file.

Installation

Set up the virtual environment and install the requirements with Poetry:

poetry install

Usage

Run the scraper with the following command:

scrapy crawl diariodeleon

Cache: It is using the HttpCacheMiddleware to cache the requestss therefore if you stop the scraper it would not be a big deal: the next time you run it, it will scrape all the previously scraped results very quickly because they will be stored in the cache avoiding making requests to diariodeleon.com

Policy: It is using a very polite scraping policy, you dont have to worry about being blocked.

Proxies: In case of desiring to use proxies, you can set them in the settings.py file inside the ROTATING_PROXY_LIST variable.

Results

The results are saved in a JSON file inside the /results directory.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
diariodeleon		diariodeleon
results		results
.gitignore		.gitignore
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraper for Diario de Leon

Installation

Usage

Results

About

Releases

Packages

Languages

Hectruendo/diariodeleon-scraper

Folders and files

Latest commit

History

Repository files navigation

Scraper for Diario de Leon

Installation

Usage

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages