-
Notifications
You must be signed in to change notification settings - Fork 692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve Spatial Partitioning From RDD to Dataframe #1268
Comments
@jwass Is there a reason why you want to use the Sedona rdd-based spatial partitioning? This is considered as low-level API and only used for spatial join. Most importantly, given polygon data, the spatial partitioned RDD will have duplicates because some polygons will cross the boundaries of multiple partitions and we duplicate those to overlapping partitions. Our spatial join algorithm will automatically de-dup after getting the join result. |
@jiayuasu What I really want to do is write out a large geoparquet dataset where the individual parquet files are spatially partitioned intelligently. This will improve performance of remote spatial queries by bounding box. We have some solutions now to split by geohash/quadkey, but a partitioning scheme backed by a kdb-tree / r-tree / etc would be better. The fact that polygons' extents will cause overlaps of the spatial partitions is fine but we do need to assign each row to only one partition. I was hoping there was a way to use |
Just working on this now! For myself as I work on this, a self-contained example: Get a geoparquet: curl -L https://github.com/MrPowers/sedona-examples/raw/refs/heads/main/data/ne_cities.parquet \
-o cities.parquet Set up the session: from sedona.spark import SedonaContext
config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config) Partition an RDD and try to write it: from sedona.spark import Adapter, GridType
df = sedona.read.format("geoparquet").load("cities.parquet")
rdd = Adapter.toSpatialRdd(df, "geometry")
rdd.analyze()
rdd.spatialPartitioning(GridType.KDBTREE, num_partitions=6)
df2 = Adapter.toDf(rdd, sedona)
df2.write.format("geoparquet").save("cities_maybe_partitioned") |
I added an official version of this in #1751, but a version you can play with today is: import geopandas
from sedona.core.geom.envelope import Envelope
from py4j.java_gateway import get_method
jvm_p = rdd.getPartitioner().jvm_partitioner
jvm_grids = get_method(jvm_p, "getGrids")()
number_of_grids = jvm_grids.size()
envelopes = [
Envelope.from_jvm_instance(jvm_grids[index])
for index in range(number_of_grids)
]
geopandas.GeoSeries(envelopes).plot(edgecolor="black", facecolor="none") |
Is there a way to spatially partition a dataframe and write it out using that partitioning scheme (presumably by converting to/from a spatial rdd)? This is my guess as to how to accomplish this but I'm not sure if I'm misunderstanding things... I'm also relatively new to working with Spark and Sedona.
Expected behavior
Loading a dataframe, converting to rdd, spatially partition it, convert back to dataframe, and save the result - I'd expect the final dataframe partitioning to be preserved from the rdd.
Actual behavior
Adapter.toDf() does not preserve partitioning - or I'm doing something else wrong.
Steps to reproduce the problem
But it looked like that doesn't work - number of partitions written in df2 was far greater than 6.
Settings
Sedona version = 1.5.1
Apache Spark version = ?
API type = Python
Python version = ?
Environment = Databricks
The text was updated successfully, but these errors were encountered: