Map-Reduce Web Crawler

Description

Map Reduce Web Crawler application for the course Big Data Techniques

Architecture

The main components of the application are:

Information Gathering - web crawler
1. Web Crawler (crawler.py) -> inputs a queue and crawls the websites by storing the html resource and parsing all the links found in the pages
2. Robot Parser (robot_parser.py) -> checks if the robot protocol allows to crawl that page
MapReduce - parallel application using MPI
1. Master -> sends the links to the workers to be processed in two phase: map and reduce
2. Worker -> process the links and store the data to the file system

Application structure

map-reduce-crawler
├── application
|   ├── files
|   ├── output
|   ├── modules
|   |   ├── __init__.py
|   |   ├── crawler.py
|   |   ├── map_reduce.py
|   |   ├── master_worker.py
|   |   └── robot_parser.py
|   ├── __init__.py
|   └── __main__.py
├── README.md
├── requirements.txt
└── setup.py

Execution

It is done in two phases:

Cloning from the git: git clone https://github.com/grigoras.alexandru/web-crawler.git
Selecting the application folder: cd web-crawler/
Creating virtual environment: virtualenv ENVIRONMENT_NAME
Selecting virtual environment: source ENVIRONMENT_NAME/bin/activate
Installing: python setup.py install
Running:
1. Crawler + MapReduce: python -m application
2. (Optional) MapReduce: mpiexec -np NUMBER_OF_PROCESSES python application/modules/master_worker.py

License

The application is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.idea		.idea
application		application
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Map-Reduce Web Crawler

Description

Architecture

Application structure

Execution

License

About

Languages

License

alexgrigoras/web_crawler

Folders and files

Latest commit

History

Repository files navigation

Map-Reduce Web Crawler

Description

Architecture

Application structure

Execution

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages