GeoTextTagger-data

Data collection and formatting tools for the research paper by Brunsting et al.

The data that was used for tests in the paper is included in the Data directory.

The rest of this documentation describes the tools that were used to obtain and format this data. Final output from the formatting is a tab separated text file.

Note that many of the instuctions assume you are working in a Linux environment.

Obtaining Twitter Data

This was done using the StreamFromTwitter program. See that folder for more information. Output from this program can be consolidated with the MergeParquet program.

Formatting Twitter Data

To create a tab-separated file where each line represents a single tweet, use the CreateTweetText program (see that folder for more details).

To create a tab-separated file where each line represents the concatenation of all tweets from a single user, use the CreateTweetTextByUser program (see that folder for more details).

Obtaining Wikipedia Data

Dumps of the English Wikipedia are available at http://dumps.wikimedia.org/enwiki.

The files you need are:

enwiki-*-pages-articles.xml.bz2
enwiki-*-geo_tags.sql.gz

Where * can be replaced with "latest" or the date of the data dump.

On Linux, you can download the files used in our paper with:

wget http://dumps.wikimedia.org/enwiki/20150602/enwiki-20150602-pages-articles.xml.bz2 (download 11 GB file with all articles)
wget http://dumps.wikimedia.org/enwiki/20150602/enwiki-20150602-geo_tags.sql.gz (download 27 MB file with all geo tags)
gunzip enwiki-20150602-geo_tags.sql.gz (unzip the geo tags file)

Formatting Wikipedia Data

This involves a number of steps, and assumes you have downloaded dumps of Wikipedia articles and geotags (see previous section).

Article Extraction

The first step is to extract the articles from your pages-articles.xml.bz2 file. We did this using software by Giuseppe Attardi. The version we used has been included in this repository (WikiExtractor.py), but the most recent version can be found here: https://github.com/attardi/wikiextractor

There are a number of command line arguments to Attardi's program. We ran the program with python WikiExtractor.py -b 10M -o extracted enwiki-20150602-pages-articles.xml

Note that due to the large number of articles in Wikipedia, this program may take a few days time to complete.

Reformatting Articles

To format the articles using Spark, we must have one article per line. The output of the Article Extraction step above does not meet this requirement, but we can fix that with the following commands:

perl -p -i -e 's/\n/ /g' extracted/*/* (replace all newlines with spaces)
perl -p -i -e 's/<\/doc> /<\/doc>\n/g' extracted/*/* (place a newline at the end of each article)
cat extracted/*/* > extracted/all.txt (place all articles into one file for convenience)

Now we are ready to join the articles with the geotags and create a tab-separated text file. This is done using the CreateWikiText program. See that folder for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeoTextTagger-data

Obtaining Twitter Data

Formatting Twitter Data

Obtaining Wikipedia Data

Formatting Wikipedia Data

Article Extraction

Reformatting Articles

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
CreateTweetText		CreateTweetText
CreateTweetTextByUser		CreateTweetTextByUser
CreateWikiText		CreateWikiText
Data		Data
MergeParquet		MergeParquet
StreamFromTwitter		StreamFromTwitter
spark-1.3.1		spark-1.3.1
.gitignore		.gitignore
README.md		README.md
WikiExtractor.py		WikiExtractor.py
build_spark.sh		build_spark.sh

sjbrunst/GeoTextTagger-data

Folders and files

Latest commit

History

Repository files navigation

GeoTextTagger-data

Obtaining Twitter Data

Formatting Twitter Data

Obtaining Wikipedia Data

Formatting Wikipedia Data

Article Extraction

Reformatting Articles

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages