Skip to content

Latest commit

 

History

History
46 lines (43 loc) · 2.06 KB

README.md

File metadata and controls

46 lines (43 loc) · 2.06 KB

Wikimapia scraper

A service that scrapes points of interest in a given country.


Installation

First, install docker and docker-compose as well as python (preferably version 3.8.3).
After cloning the repository, download the required modules (when in the repository's main folder):

$ python -m pip install -r ./docker/scraper/requirements.txt

From the main folder, cd into docker and again into docker-compose. There, execute docker-compose up:

$ cd ./docker/docker-compose
$ docker-compose up

Then, in a new terminal, cd back to the repository's main folder, and run the main script along with a country, a scraping source (api/html) and an output json:

$ python ./source_code/main.py -c france -s api -o france.json

Or, for more info about the CLI:

$ python ./source_code/main.py --help

Architecture

This tool consists of 3 main pieces:

  • MongoDB container
  • Tor container
  • Main Script

To use it, first run docker-compose which will run the two containers.
Then, when running the script from the repository's main folder (as shown above), the script will check the DB to see if the data of the given country has already been collected.
If it hasn't, wikimapia will be scraped for the said data, and will insert it to the DB.
The need for the tor container is to bypass the connection limit to the server (which is set by IP). Tor acts as a proxy.
In either case, a GeoJSON will be created in the file specified in the CLI.


Configuration

The configuration files in this project are the docker-compose.yml (which will usually work fine without modification), and the .env (which also should work). You may wish to change the TOR_SWITCH_IP_EVERY field in corellation to the scraping source you've chosen.

  • API
    Scraping with the api may give better info, with the chance of being slower. It requires an ip switch every 3 requests to not get blocked.
  • HTML
    Scraping with the html will give less information, but requires less ip changes. The recommended ip switches are after 20 requests.