Skip to content

alexgrigoras/web_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Map-Reduce Web Crawler

Description

Map Reduce Web Crawler application for the course Big Data Techniques

Architecture

The main components of the application are:

  1. Information Gathering - web crawler

    1. Web Crawler (crawler.py) -> inputs a queue and crawls the websites by storing the html resource and parsing all the links found in the pages
    2. Robot Parser (robot_parser.py) -> checks if the robot protocol allows to crawl that page
  2. MapReduce - parallel application using MPI

    1. Master -> sends the links to the workers to be processed in two phase: map and reduce
    2. Worker -> process the links and store the data to the file system

Application structure

map-reduce-crawler
├── application
|   ├── files
|   ├── output
|   ├── modules
|   |   ├── __init__.py
|   |   ├── crawler.py
|   |   ├── map_reduce.py
|   |   ├── master_worker.py
|   |   └── robot_parser.py
|   ├── __init__.py
|   └── __main__.py
├── README.md
├── requirements.txt
└── setup.py

Execution

It is done in two phases:

  1. Cloning from the git: git clone https://github.com/grigoras.alexandru/web-crawler.git
  2. Selecting the application folder: cd web-crawler/
  3. Creating virtual environment: virtualenv ENVIRONMENT_NAME
  4. Selecting virtual environment: source ENVIRONMENT_NAME/bin/activate
  5. Installing: python setup.py install
  6. Running:
    1. Crawler + MapReduce: python -m application
    2. (Optional) MapReduce: mpiexec -np NUMBER_OF_PROCESSES python application/modules/master_worker.py

License

The application is licensed under the MIT License.