Get Started

In order to be able to submit RDFStats at spark-submit, first clone the repo and compile the project.

Load the RDF datasets

Before computing statistics, download datasets and upload them on the HDFS. The following steps should be taken :

Download DBPedia and extract it on bin .nt file

Dbpedia en

wget -r -np -nd -nc -A'*.nt.bz2' http://downloads.dbpedia.org/3.9/en/
cat *.nt.bz2 >Dbpedia_en.nt.bz2
bzip2 -d Dbpedia_en.nt.bz2
hadoop fs -put Dbpedia_en.nt /<pathToHDFS>/

Dbpedia de

wget -r -np -nd -nc -A'*.nt.bz2' http://downloads.dbpedia.org/3.9/de/
cat *.nt.bz2 >Dbpedia_de.nt.bz2
bzip2 -d Dbpedia_de.nt.bz2
hadoop fs -put Dbpedia_de.nt /<pathToHDFS>/

Dbpedia fr

wget -r -np -nd -nc -A'*.nt.bz2' http://downloads.dbpedia.org/3.9/fr/
cat *.nt.bz2 >Dbpedia_fr.nt.bz2
bzip2 -d Dbpedia_fr.nt.bz2
hadoop fs -put Dbpedia_fr.nt /<pathToHDFS>/

Download LinkedGeoData and extract it on bin .nt file

wget -r -np -nd -nc -A'*.nt.bz2' http://downloads.linkedgeodata.org/releases/2015-11-02/
cat *.nt.bz2 >LinkedGeoData.nt.bz2
bzip2 -d LinkedGeoData.nt.bz2
hadoop fs -put LinkedGeoData.nt /<pathToHDFS>/

Generate BSBM datasets

wget http://downloads.sourceforge.net/project/bsbmtools/bsbmtools/bsbmtools-0.2/bsbmtools-v0.2.zip
unzip bsbmtools-v0.2.zip
cd bsbmtools-0.2/

We have generated the dataset based on the size :

 "./generate -fc -s nt -fn BSBM_2GB -pc 23336"
 "./generate -fc -s nt -fn BSBM_20GB -pc 233368"
 "./generate -fc -s nt -fn BSBM_200GB -pc 2333682"
 "./generate -fc -s nt -fn BSBM_50GB -pc 583420" 
 "./generate -fc -s nt -fn BSBM_100GB -pc 1166840"

hadoop fs -put BSBM_XGB.nt /<pathToHDFS>/

Experiments

Distributed Processing on Large-Scale Datasets Run DistLODStats agains the datasets and get the stats generated. Run the following command to get the results:
- For cluster mode
```
./run_stats.sh Dbpedia_en Iter1
```
- For local mode
```
./run_stats-local.sh Dbpedia_en Iter1
```
Scalability
- Sizeup scalability To measure the performance of size-up scalability of our approach, we run experiments on three different sizes.
- Node scalability In order to measure node scalability, we use variations of the number of the workers on our cluster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Get Started

Load the RDF datasets

Experiments

Files

README.md

Latest commit

History

README.md

File metadata and controls

Get Started

Load the RDF datasets

Experiments