Skip to content

Hectruendo/diariodeleon-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scraper for Diario de Leon

Simple scraper for Diario de Leon. It scrapes sitemap and saves them in a JSON file.

Installation

Set up the virtual environment and install the requirements with Poetry:

poetry install

Usage

Run the scraper with the following command:

scrapy crawl diariodeleon

Cache: It is using the HttpCacheMiddleware to cache the requestss therefore if you stop the scraper it would not be a big deal: the next time you run it, it will scrape all the previously scraped results very quickly because they will be stored in the cache avoiding making requests to diariodeleon.com

Policy: It is using a very polite scraping policy, you dont have to worry about being blocked.

Proxies: In case of desiring to use proxies, you can set them in the settings.py file inside the ROTATING_PROXY_LIST variable.

Results

The results are saved in a JSON file inside the /results directory.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages