This is a python based website crawling script equipped with following capabilities to trick the website robot and avoid getting blocked.
- Random time intervals
- User-Agent switching
- IP rotation through proxy server
Although, the script has been tested on Python3.6 but it should work with future versions of Python too.
Install the libraries provided in requirements.txt by using following command
python -m pip install -r requirements.txt
run the script by using following command
python websitescrap.py https://www.wikipedia.org
if you wish to start the crawling afresh from the supplied address, please use following command
python websitescrap.py https://www.wikipedia.org --start_afresh true
The output will be a number of json files (stored in /data/ directory), where each file contains the raw html content of webpage in the following format
{"url": "raw html content of the webpage associated with URL"}