Skip to content

A service that scrapes points of interest in a given country.

Notifications You must be signed in to change notification settings

orireiter/wikimapia_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikimapia scraper

A service that scrapes points of interest in a given country.


Installation

First, install docker and docker-compose as well as python (preferably version 3.8.3).
After cloning the repository, download the required modules (when in the repository's main folder):

$ python -m pip install -r ./docker/scraper/requirements.txt

From the main folder, cd into docker and again into docker-compose. There, execute docker-compose up:

$ cd ./docker/docker-compose
$ docker-compose up

Then, in a new terminal, cd back to the repository's main folder, and run the main script along with a country, a scraping source (api/html) and an output json:

$ python ./source_code/main.py -c france -s api -o france.json

Or, for more info about the CLI:

$ python ./source_code/main.py --help

Architecture

This tool consists of 3 main pieces:

  • MongoDB container
  • Tor container
  • Main Script

To use it, first run docker-compose which will run the two containers.
Then, when running the script from the repository's main folder (as shown above), the script will check the DB to see if the data of the given country has already been collected.
If it hasn't, wikimapia will be scraped for the said data, and will insert it to the DB.
The need for the tor container is to bypass the connection limit to the server (which is set by IP). Tor acts as a proxy.
In either case, a GeoJSON will be created in the file specified in the CLI.


Configuration

The configuration files in this project are the docker-compose.yml (which will usually work fine without modification), and the .env (which also should work). You may wish to change the TOR_SWITCH_IP_EVERY field in corellation to the scraping source you've chosen.

  • API
    Scraping with the api may give better info, with the chance of being slower. It requires an ip switch every 3 requests to not get blocked.
  • HTML
    Scraping with the html will give less information, but requires less ip changes. The recommended ip switches are after 20 requests.

About

A service that scrapes points of interest in a given country.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published