[WIP] Expose spatial partitioning from SpatialRDD #1751

paleolimbot · 2025-01-10T22:43:53Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-695. The PR name follows the format [SEDONA-695] my subject.

Closes #1268.

What changes were proposed in this PR?

This PR exposes spatial partitioning information from the SpatialRDD API. Sedona is exceptionally good at this and the spatial community would love to have access to this information!

There are two pieces of information that would be helpful:

The actual boundaries
A partitioned RDD that remembers the partition identifier (i.e., partitioned results).

There are a few ideas in this PR...the boundaries seem straightforward but I'm a little new to the RDD API to know what the options are for returning these things.

How was this patch tested?

Working on it!

Did this PR include necessary documentation updates?

Yes, I am adding a new API. I am using the current SNAPSHOT version number in vX.Y.Z format.
Yes, I have updated the documentation. (Or will when the API is settled)

paleolimbot

This also would benefit from a SpatialPartitioner that removes duplicates (perhaps by wrapping a SpatialPartitioner, consuming the result of placeObject and deterministically choosing one of the results), since most of the time having duplicates when partitioning is not really desired.

spark/common/src/main/java/org/apache/sedona/core/spatialRDD/SpatialRDD.java

spark/common/src/main/scala/org/apache/sedona/sql/utils/Adapter.scala

paleolimbot · 2025-01-22T17:03:47Z

Ok! This seems to work:

import os
import pyspark
from sedona.spark import SedonaContext
if "SPARK_HOME" in os.environ:
    del os.environ["SPARK_HOME"]
pyspark_version = pyspark.__version__[:pyspark.__version__.rfind(".")]

config = (
    SedonaContext.builder()
    .config(
        "spark.jars",
        "spark-shaded/target/sedona-spark-shaded-3.3_2.12-1.7.1-SNAPSHOT.jar",
    )
    .config(
        "spark.jars.packages",
        "org.datasyslab:geotools-wrapper:1.7.0-28.5",
    )
    .config(
        "spark.jars.repositories",
        "https://artifacts.unidata.ucar.edu/repository/unidata-all",
    )
    .getOrCreate()
)
sedona = SedonaContext.create(config)

from sedona.spark import Adapter, GridType

!rm -rf cities_maybe_partitioned

df =  sedona.read.format("geoparquet").load("cities.parquet")
rdd = Adapter.toSpatialRdd(df, "geometry")
rdd.analyze()
rdd.spatialPartitioning(GridType.KDBTREE, num_partitions=6)

df2 = Adapter.toDfPartitioned(rdd, sedona)
df2.write.format("geoparquet").save("cities_maybe_partitioned")

!ls cities_maybe_partitioned/*.parquet
#> cities_maybe_partitioned/part-00000-809113f8-6f63-4763-bbd0-3ba609efcdfd-c000.snappy.parquet
#> cities_maybe_partitioned/part-00001-809113f8-6f63-4763-bbd0-3ba609efcdfd-c000.snappy.parquet
#> cities_maybe_partitioned/part-00002-809113f8-6f63-4763-bbd0-3ba609efcdfd-c000.snappy.parquet
#> cities_maybe_partitioned/part-00003-809113f8-6f63-4763-bbd0-3ba609efcdfd-c000.snappy.parquet
#> cities_maybe_partitioned/part-00004-809113f8-6f63-4763-bbd0-3ba609efcdfd-c000.snappy.parquet
#> cities_maybe_partitioned/part-00005-809113f8-6f63-4763-bbd0-3ba609efcdfd-c000.snappy.parquet
#> cities_maybe_partitioned/part-00006-809113f8-6f63-4763-bbd0-3ba609efcdfd-c000.snappy.parquet
#> cities_maybe_partitioned/part-00007-809113f8-6f63-4763-bbd0-3ba609efcdfd-c000.snappy.parquet

james-willis · 2025-01-23T02:26:16Z

Would it be better to add a preservePartitions arguments to the toDf method?

paleolimbot

For this to be universally useful (and to remove the warning in code and in the docs), we'll need partitioners that don't introduce duplicates. I'm happy to do that here or in another PR (with my preference being another PR since the changes are largely orthogonal).

Would it be better to add a preservePartitions arguments to the toDf method?

@james-willis I didn't forget about this! I'm happy to defer to anything here...I added it as a separate function because there were already a lot of toDf overloads and I wasn't sure if there would have to be any more options for toDfPartitioned (there aren't yet, so maybe it wasn't needed!).

paleolimbot · 2025-01-27T19:35:55Z

spark/common/src/main/scala/org/apache/sedona/sql/utils/Adapter.scala

+    @transient lazy val log = LoggerFactory.getLogger(getClass.getName)
+    log.warn(
+      "toDfParitioned() may introduce duplicates when used with non-specialized partitioning")


Is this the correct way to go about this? Other classes use with Logging to get a instance-specific logger, but I didn't know how to rig that here since the Adapter is an object and not a class?

jiayuasu · 2025-02-04T05:52:53Z

Close in favor of #1780

paleolimbot added 3 commits January 10, 2025 20:49

first pass at keeping identifier

fb55d91

more ideas

3f60bb0

plural?

eda577e

github-actions bot added sedona-python sedona-spark labels Jan 10, 2025

paleolimbot mentioned this pull request Jan 10, 2025

Preserve Spatial Partitioning From RDD to Dataframe #1268

Closed

paleolimbot added 6 commits January 13, 2025 21:06

maybe fix format

e879260

add grids

9bf2f61

add Java test

2c3642c

formatting

f1d783e

see if this works

63f3d6a

format

5eedbf1

paleolimbot commented Jan 21, 2025

View reviewed changes

spark/common/src/main/java/org/apache/sedona/core/spatialRDD/SpatialRDD.java Outdated Show resolved Hide resolved

spark/common/src/main/scala/org/apache/sedona/sql/utils/Adapter.scala Outdated Show resolved Hide resolved

implement partitioned

c971f5f

paleolimbot added 3 commits January 22, 2025 17:05

format

f52a359

revert unneeded changes

f110108

add test TODOs

aba7b33

github-actions bot added the root label Jan 22, 2025

paleolimbot added 2 commits January 24, 2025 15:32

remove accidental file

2637612

add python test

9aff8cf

github-actions bot removed the root label Jan 24, 2025

paleolimbot added 7 commits January 24, 2025 17:41

first try at a scala test

bcdb75d

fix glob usage

ddfc337

try to fix scala test

03ed814

actually building scala test

a2f46df

maybe actually count

d3757ff

possibly passing test

5be6997

maybe warn, maybe fix test for random

0deeb64

add docs

a9c6d66

github-actions bot added the docs label Jan 27, 2025

paleolimbot marked this pull request as ready for review January 27, 2025 19:32

paleolimbot requested a review from jiayuasu as a code owner January 27, 2025 19:32

paleolimbot commented Jan 27, 2025

View reviewed changes

paleolimbot mentioned this pull request Jan 29, 2025

[SEDONA-705] Add unique partitioner wrapper to enable partitioned writes with Sedona #1778

Merged

jiayuasu closed this Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Expose spatial partitioning from SpatialRDD #1751

[WIP] Expose spatial partitioning from SpatialRDD #1751

paleolimbot commented Jan 10, 2025 •

edited

Loading

paleolimbot left a comment

paleolimbot commented Jan 22, 2025

james-willis commented Jan 23, 2025

paleolimbot left a comment

paleolimbot Jan 27, 2025

jiayuasu commented Feb 4, 2025

[WIP] Expose spatial partitioning from SpatialRDD #1751

[WIP] Expose spatial partitioning from SpatialRDD #1751

Conversation

paleolimbot commented Jan 10, 2025 • edited Loading

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

paleolimbot left a comment

Choose a reason for hiding this comment

paleolimbot commented Jan 22, 2025

james-willis commented Jan 23, 2025

paleolimbot left a comment

Choose a reason for hiding this comment

paleolimbot Jan 27, 2025

Choose a reason for hiding this comment

jiayuasu commented Feb 4, 2025

paleolimbot commented Jan 10, 2025 •

edited

Loading