Preserve Spatial Partitioning From RDD to Dataframe #1268

jwass · 2024-03-05T18:59:24Z

Is there a way to spatially partition a dataframe and write it out using that partitioning scheme (presumably by converting to/from a spatial rdd)? This is my guess as to how to accomplish this but I'm not sure if I'm misunderstanding things... I'm also relatively new to working with Spark and Sedona.

Expected behavior

Loading a dataframe, converting to rdd, spatially partition it, convert back to dataframe, and save the result - I'd expect the final dataframe partitioning to be preserved from the rdd.

Actual behavior

Adapter.toDf() does not preserve partitioning - or I'm doing something else wrong.

Steps to reproduce the problem

df =  sedona.read.format("geoparquet").load(path)
rdd = Adapter.toSpatialRdd(df, "geometry")
rdd.analyze()
rdd.spatialPartitioning(GridType.KDBTREE, num_partitions=6)

df2 = Adapter.toDf(rdd, spark)
df2.write.format("geoparquet").save(output_path)

But it looked like that doesn't work - number of partitions written in df2 was far greater than 6.

Settings

Sedona version = 1.5.1

Apache Spark version = ?

API type = Python

Python version = ?

Environment = Databricks

The text was updated successfully, but these errors were encountered:

jiayuasu · 2024-03-06T08:20:15Z

@jwass Is there a reason why you want to use the Sedona rdd-based spatial partitioning? This is considered as low-level API and only used for spatial join.

Most importantly, given polygon data, the spatial partitioned RDD will have duplicates because some polygons will cross the boundaries of multiple partitions and we duplicate those to overlapping partitions. Our spatial join algorithm will automatically de-dup after getting the join result.

jwass · 2024-03-06T14:56:54Z

@jwass Is there a reason why you want to use the Sedona rdd-based spatial partitioning? This is considered as low-level API and only used for spatial join.

Most importantly, given polygon data, the spatial partitioned RDD will have duplicates because some polygons will cross the boundaries of multiple partitions and we duplicate those to overlapping partitions. Our spatial join algorithm will automatically de-dup after getting the join result.

@jiayuasu What I really want to do is write out a large geoparquet dataset where the individual parquet files are spatially partitioned intelligently. This will improve performance of remote spatial queries by bounding box. We have some solutions now to split by geohash/quadkey, but a partitioning scheme backed by a kdb-tree / r-tree / etc would be better. The fact that polygons' extents will cause overlaps of the spatial partitions is fine but we do need to assign each row to only one partition. I was hoping there was a way to use df.repartition with the spatial rdd's partitioner to make it all work. But let me know if this is not the right use for this.

paleolimbot · 2025-01-10T21:37:25Z

Just working on this now! For myself as I work on this, a self-contained example:

Get a geoparquet:

curl -L https://github.com/MrPowers/sedona-examples/raw/refs/heads/main/data/ne_cities.parquet \
    -o cities.parquet

Set up the session:

from sedona.spark import SedonaContext

config = SedonaContext.builder().getOrCreate()
sedona = SedonaContext.create(config)

Partition an RDD and try to write it:

from sedona.spark import Adapter, GridType

df =  sedona.read.format("geoparquet").load("cities.parquet")
rdd = Adapter.toSpatialRdd(df, "geometry")
rdd.analyze()
rdd.spatialPartitioning(GridType.KDBTREE, num_partitions=6)

df2 = Adapter.toDf(rdd, sedona)
df2.write.format("geoparquet").save("cities_maybe_partitioned")

paleolimbot · 2025-01-10T22:46:38Z

I added an official version of this in #1751, but a version you can play with today is:

import geopandas
from sedona.core.geom.envelope import Envelope
from py4j.java_gateway import get_method

jvm_p = rdd.getPartitioner().jvm_partitioner
jvm_grids = get_method(jvm_p, "getGrids")()
number_of_grids = jvm_grids.size()
envelopes = [
    Envelope.from_jvm_instance(jvm_grids[index])
    for index in range(number_of_grids)
]

geopandas.GeoSeries(envelopes).plot(edgecolor="black", facecolor="none")

paleolimbot mentioned this issue Jan 10, 2025

[WIP] Expose spatial partitioning from SpatialRDD #1751

Closed

jiayuasu linked a pull request Feb 6, 2025 that will close this issue

[SEDONA-705] Add unique partitioner wrapper to enable partitioned writes with Sedona #1778

Merged

jiayuasu closed this as completed in #1778 Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve Spatial Partitioning From RDD to Dataframe #1268

Preserve Spatial Partitioning From RDD to Dataframe #1268

jwass commented Mar 5, 2024 •

edited

Loading

jiayuasu commented Mar 6, 2024

jwass commented Mar 6, 2024

paleolimbot commented Jan 10, 2025

paleolimbot commented Jan 10, 2025

Preserve Spatial Partitioning From RDD to Dataframe #1268

Preserve Spatial Partitioning From RDD to Dataframe #1268

Comments

jwass commented Mar 5, 2024 • edited Loading

Expected behavior

Actual behavior

Steps to reproduce the problem

Settings

jiayuasu commented Mar 6, 2024

jwass commented Mar 6, 2024

paleolimbot commented Jan 10, 2025

paleolimbot commented Jan 10, 2025

jwass commented Mar 5, 2024 •

edited

Loading