Map Reduce Web Crawler application for the course Big Data Techniques
The main components of the application are:
-
Information Gathering - web crawler
- Web Crawler (crawler.py) -> inputs a queue and crawls the websites by storing the html resource and parsing all the links found in the pages
- Robot Parser (robot_parser.py) -> checks if the robot protocol allows to crawl that page
-
MapReduce - parallel application using MPI
- Master -> sends the links to the workers to be processed in two phase: map and reduce
- Worker -> process the links and store the data to the file system
map-reduce-crawler
├── application
| ├── files
| ├── output
| ├── modules
| | ├── __init__.py
| | ├── crawler.py
| | ├── map_reduce.py
| | ├── master_worker.py
| | └── robot_parser.py
| ├── __init__.py
| └── __main__.py
├── README.md
├── requirements.txt
└── setup.py
It is done in two phases:
- Cloning from the git:
git clone https://github.com/grigoras.alexandru/web-crawler.git
- Selecting the application folder:
cd web-crawler/
- Creating virtual environment:
virtualenv ENVIRONMENT_NAME
- Selecting virtual environment:
source ENVIRONMENT_NAME/bin/activate
- Installing:
python setup.py install
- Running:
- Crawler + MapReduce:
python -m application
- (Optional) MapReduce:
mpiexec -np NUMBER_OF_PROCESSES python application/modules/master_worker.py
- Crawler + MapReduce:
The application is licensed under the MIT License.