Any suggestions about build index using Apache-spark? #9

ChenZhongPu · 2015-03-12T01:33:37Z

As the question, I want to build R-tree index for spatial data using Apache-Spark, the input and output is HDFS.

The text was updated successfully, but these errors were encountered:

qadahtm · 2015-03-15T01:42:00Z

Sorry for the delay, I just noticed this.

I am actually working on building multiple types of indexes from HDFS data using Spark. Currently, I have the a simple implementation of a Grid index. I am looking to build an R-tree as well but I have not figured it all out yet.

What dataset are you working with?

merlintang · 2015-03-15T02:12:58Z

I have used the sampling data from HDFS and build a quadtree from the
sampling data. I think it is not very difficulty to move this code to
support MR model for the building quadtree. contact me, if you are interest
in this.

meanwhile, what is the relationship between building SP-GIST via MapReduce?
there are some other students to do this now.

On Sat, Mar 14, 2015 at 9:42 PM, Thamir Qadah notifications@github.com
wrote:

I am actually working on building multiple types of indexes from HDFS data
using Spark. Currently, I have the a simple implementation of a Grid index.
I am looking to build an R-tree as well but I have not figured it all out
yet.

What dataset are you working with?

—
Reply to this email directly or view it on GitHub
#9 (comment)
.

qadahtm · 2015-03-15T03:52:00Z

Hi Mingjie,

The SP-GIST is a framework of building space partitioning trees and I believe that the other students are trying to realize the framework and make it extensible. To my understanding, the main difference is that the are focusing on the framework and how it can be used realize multiple space partitioning search trees while I focus on standalone spark applications that will build spatial indexes for my datasets. I need to build an index for (60GB to ~500GB) of data and maybe more.

For example, you cannot use SP-GIST to realize an R-Tree index or its variants because they are not space partitioning search trees. The limitation also applies to the Grid index because the grid is not actually a search tree.

I will definitely seek your advice when I get to the quad-tree implementation if I need to.

ChenZhongPu · 2015-03-16T00:27:23Z

I know that there is a project for spatial data based on Hadoop's Map-Reduce http://spatialhadoop.cs.umn.edu/. And I would like to move it to Apache-Spark. Suppose the input file in HDFS is point or rectangle dataset, I now want to build R-tree index as Spatial Hadoop does.

qadahtm · 2015-03-18T05:05:29Z

@ChenZhongPu

Have you tried using the current implementation of the map/reduce functions that exist in SpatialHadoop but calling them from Spark? It depends on how it is written ( I have not looked at it myself) but you may need modify the code a little bit.

If you just need to build an R-tree index, this is one way to go. Otherwise, you will need to come up with your own approach (SparkApp) of building the R-tree index using Spark.

I am also using datasets from the SpatialHadoop project but I will probably have my own approach.

I will let you know if I got something up.

BTW, @ChenZhongPu, What is your affiliation?

Regards,

ChenZhongPu · 2015-03-18T06:37:10Z

@qadahtm

I have found a nice java library http://www.vividsolutions.com/jts/javadoc/index.html for building R-tree index using STR.

For very big dataset, is it recommended for parallel using Apache-Spark? And I also want to save the index into file in HDFS for later usage. Since I am not very familiar with RDD operations, I will appreciate it if you give me more hints or tips on code demo.

See more at http://stackoverflow.com/questions/29113702/strtree-in-jts-topology-suite-bulk-load-data-and-build-index.

Last, I am just a CS college student in China @qadahtm .

qadahtm · 2015-03-18T23:17:20Z

@ChenZhongPu
Thanks for the links.

You may want to check my other repository on indexing using Spark https://github.com/qadahtm/SpatialSparkIndexer/blob/master/src/main/scala/qadahtm/OSMGridIndex.scala for some hints. I only have the code for the Grid-based index for now. This is still a work in progress but I hope it can help you.

Regards,
Thamir

ChenZhongPu · 2015-03-19T06:41:25Z

@qadahtm,
Please have a look at my question, http://stackoverflow.com/questions/29138425/apache-spark-method-in-foreach-doesnt-work. Thanks advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any suggestions about build index using Apache-spark? #9

Any suggestions about build index using Apache-spark? #9

ChenZhongPu commented Mar 12, 2015

qadahtm commented Mar 15, 2015

merlintang commented Mar 15, 2015

qadahtm commented Mar 15, 2015

ChenZhongPu commented Mar 16, 2015

qadahtm commented Mar 18, 2015

ChenZhongPu commented Mar 18, 2015

qadahtm commented Mar 18, 2015

ChenZhongPu commented Mar 19, 2015

Any suggestions about build index using Apache-spark? #9

Any suggestions about build index using Apache-spark? #9

Comments

ChenZhongPu commented Mar 12, 2015

qadahtm commented Mar 15, 2015

merlintang commented Mar 15, 2015

qadahtm commented Mar 15, 2015

ChenZhongPu commented Mar 16, 2015

qadahtm commented Mar 18, 2015

ChenZhongPu commented Mar 18, 2015

qadahtm commented Mar 18, 2015

ChenZhongPu commented Mar 19, 2015