Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Availability of Spatial Indexes #17

Open
rexdouglass opened this issue Jun 8, 2020 · 3 comments
Open

Availability of Spatial Indexes #17

rexdouglass opened this issue Jun 8, 2020 · 3 comments

Comments

@rexdouglass
Copy link

rexdouglass commented Jun 8, 2020

I'm experimenting with geospark and find the spatial joins slower than expected.

I've set geospark.join.gridtype to "kdbtree" in my configuration below.

Is there something else I need to do to enable or use spatial indexes when creating, saving, or joining on parquets with a geom column?

library(tidyverse)
library(sparklyr)
library(geospark)
conf <- spark_config()

conf$`sparklyr.cores.local` <- 48
conf$spark.serializer <- "org.apache.spark.serializer.KryoSerializer"
conf$spark.kryo.registrator <- "org.datasyslab.geospark.serde.GeoSparkKryoRegistrator"
conf$spark.kryoserializer.buffer.max <- "2047MB" #Caused by: java.lang.IllegalArgumentException: spark.kryoserializer.buffer.max must be less than 2048 mb, got: + 10240 mb.
conf$geospark.join.gridtype <- "kdbtree"

conf$spark.driver.maxResultSize <- "30G"
conf$spark.memory.fraction <- 0.9
conf$spark.executor.heartbeatInterval <-"6000000s"# "10000000s"
conf$spark.network.timeout <- "6000001s"
conf$spark.local.dir <- "/media/skynet2/905884f0-7546-4273-9061-12a790830beb/spark_temp/"
conf$spark.worker.cleanup.enabled <- "true"
conf$"sparklyr.shell.driver-memory"= "300G"
conf$'spark.driver.maxResultSize' <- 0 #0 is ulimmited
sc <- spark_connect(master = "local", config = conf,
                    version = "2.3.3" #for geospark
) 
sc <- register_gis(sc)
@rexdouglass
Copy link
Author

rexdouglass commented Jun 9, 2020

Another theory I have is the spatial join is keeping copies of both the A and B geoms. When the B geom is large, it's making many redundant unnecessary copies of it. Setting keep=F doesn't alter the behavior

E.g.
d3 <- d1 %>% left_join(d2 %>% rename(geom_b=geom), sql_on = sql("ST_Within(geom_a, geom_b)"), keep=F) %>% dplyr::select(-geom_b,-geom_a)

New experiment, rewriting the sql to remove the geoms in the first select as opposed to after the fact does speed up things but not more than 5%. Most of the overhead is still somewhere else.

sql <- "SELECT `LHS`.`osm_id` AS `osm_id`, `RHS`.`geoboundaries_adm0_shapeID_noversion` 
FROM `d1` AS `LHS`
LEFT JOIN (SELECT `geoboundaries_adm0_shapeName`, `geoboundaries_adm0_shapeID_noversion`, `geoboundaries_adm0_shapeISO`, `geom` AS `geom_b`
FROM `d2`) `RHS`
ON (ST_Within(geom_a, geom_b))"

d3 <- sql %>%
  dbplyr::sql() %>%
  tbl(sc, .)

@rexdouglass
Copy link
Author

rexdouglass commented Jun 9, 2020

My last experiment improved performance into ranges I was expecting, the right hand side polygons are very large and detailed (national boundaries).

Hitting them with mutate(geom=ST_SimplifyPreserveTopology(geom,.001)) increased the speed of the join into the sublinear range in a way that makes me think that it's using a spatial index though I might still not have that configured correctly.

Setting the simplification threshold to 0.1 or 0.01 looked visually ok but threw "com.vividsolutions.jts.geom.TopologyException: side location conflict" errors when used in the join.

I hope this is useful to others, and that there are other performance tuning steps I missed that people can recommend.

Even with the simplified polygons, my left join of 480k points within 198 possible country polygons looks like it's going to take 6 hours on a 48 thread machine.

@harryprince
Copy link
Owner

harryprince commented Jul 26, 2020

@rexdouglass it seems you do not turn on the spatial index function.

notice: the bigger table should be in the first place, and the small one follows.

this function works well on our production environment like this:

  SELECT fence_id,top_poi_id,fence_df.geoshape as fence_geoshape,roadid,roadname,roaduid,wkt,'1' as is_isolation 
        FROM road_df,fence_df  
        WHERE  ST_Intersects(road_df.geoshape,fence_df.geoshape) 

here is another tip: ST_Intersects is better than ST_Within and ST_Contains when you do not clearly know the order between two geometries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants