Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any suggestions about build index using Apache-spark? #9

Open
ChenZhongPu opened this issue Mar 12, 2015 · 8 comments
Open

Any suggestions about build index using Apache-spark? #9

ChenZhongPu opened this issue Mar 12, 2015 · 8 comments

Comments

@ChenZhongPu
Copy link

As the question, I want to build R-tree index for spatial data using Apache-Spark, the input and output is HDFS.

@qadahtm
Copy link
Owner

qadahtm commented Mar 15, 2015

Sorry for the delay, I just noticed this.

I am actually working on building multiple types of indexes from HDFS data using Spark. Currently, I have the a simple implementation of a Grid index. I am looking to build an R-tree as well but I have not figured it all out yet.

What dataset are you working with?

@merlintang
Copy link
Collaborator

I have used the sampling data from HDFS and build a quadtree from the
sampling data. I think it is not very difficulty to move this code to
support MR model for the building quadtree. contact me, if you are interest
in this.

meanwhile, what is the relationship between building SP-GIST via MapReduce?
there are some other students to do this now.

On Sat, Mar 14, 2015 at 9:42 PM, Thamir Qadah notifications@github.com
wrote:

I am actually working on building multiple types of indexes from HDFS data
using Spark. Currently, I have the a simple implementation of a Grid index.
I am looking to build an R-tree as well but I have not figured it all out
yet.

What dataset are you working with?


Reply to this email directly or view it on GitHub
#9 (comment)
.

@qadahtm
Copy link
Owner

qadahtm commented Mar 15, 2015

Hi Mingjie,

The SP-GIST is a framework of building space partitioning trees and I believe that the other students are trying to realize the framework and make it extensible. To my understanding, the main difference is that the are focusing on the framework and how it can be used realize multiple space partitioning search trees while I focus on standalone spark applications that will build spatial indexes for my datasets. I need to build an index for (60GB to ~500GB) of data and maybe more.

For example, you cannot use SP-GIST to realize an R-Tree index or its variants because they are not space partitioning search trees. The limitation also applies to the Grid index because the grid is not actually a search tree.

I will definitely seek your advice when I get to the quad-tree implementation if I need to.

@ChenZhongPu
Copy link
Author

I know that there is a project for spatial data based on Hadoop's Map-Reduce http://spatialhadoop.cs.umn.edu/. And I would like to move it to Apache-Spark. Suppose the input file in HDFS is point or rectangle dataset, I now want to build R-tree index as Spatial Hadoop does.

@qadahtm
Copy link
Owner

qadahtm commented Mar 18, 2015

@ChenZhongPu

Have you tried using the current implementation of the map/reduce functions that exist in SpatialHadoop but calling them from Spark? It depends on how it is written ( I have not looked at it myself) but you may need modify the code a little bit.

If you just need to build an R-tree index, this is one way to go. Otherwise, you will need to come up with your own approach (SparkApp) of building the R-tree index using Spark.

I am also using datasets from the SpatialHadoop project but I will probably have my own approach.

I will let you know if I got something up.

BTW, @ChenZhongPu, What is your affiliation?

Regards,

@ChenZhongPu
Copy link
Author

@qadahtm

I have found a nice java library http://www.vividsolutions.com/jts/javadoc/index.html for building R-tree index using STR.

For very big dataset, is it recommended for parallel using Apache-Spark? And I also want to save the index into file in HDFS for later usage. Since I am not very familiar with RDD operations, I will appreciate it if you give me more hints or tips on code demo.

See more at http://stackoverflow.com/questions/29113702/strtree-in-jts-topology-suite-bulk-load-data-and-build-index.

Last, I am just a CS college student in China @qadahtm .

@qadahtm
Copy link
Owner

qadahtm commented Mar 18, 2015

@ChenZhongPu
Thanks for the links.

You may want to check my other repository on indexing using Spark https://github.com/qadahtm/SpatialSparkIndexer/blob/master/src/main/scala/qadahtm/OSMGridIndex.scala for some hints. I only have the code for the Grid-based index for now. This is still a work in progress but I hope it can help you.

Regards,
Thamir

@ChenZhongPu
Copy link
Author

@qadahtm,
Please have a look at my question, http://stackoverflow.com/questions/29138425/apache-spark-method-in-foreach-doesnt-work. Thanks advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants