GitHub - vmummer/web-scraper

Introduction

This is a python based website crawling script equipped with following capabilities to trick the website robot and avoid getting blocked.

Random time intervals
User-Agent switching
IP rotation through proxy server

Installation

Although, the script has been tested on Python3.6 but it should work with future versions of Python too.

Install the libraries provided in requirements.txt by using following command

python -m pip install -r requirements.txt

Running the script

run the script by using following command
python websitescrap.py https://www.wikipedia.org

if you wish to start the crawling afresh from the supplied address, please use following command
python websitescrap.py https://www.wikipedia.org --start_afresh true

Output

The output will be a number of json files (stored in /data/ directory), where each file contains the raw html content of webpage in the following format

{"url": "raw html content of the webpage associated with URL"}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
utils		utils
.gitignore		.gitignore
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
websitescrap.py		websitescrap.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Installation

Running the script

Output

About

Releases

Packages

Contributors 2

Languages

vmummer/web-scraper

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Running the script

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages