[SEDONA-705] Add unique partitioner wrapper to enable partitioned writes with Sedona #1778

paleolimbot · 2025-01-29T21:05:53Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a JIRA ticket?

Yes, the URL of the associated JIRA ticket is https://issues.apache.org/jira/browse/SEDONA-705. The PR name follows the format [SEDONA-705] my subject.

What changes were proposed in this PR?

After SEDONA-695 (#1751), we will have the ability to do partitioned writes! To be useful in most contexts, we also need the ability to create those partitions assigning a unique partition to each feature (i.e., don't introduce duplicates).

Added a set of spatialPartitioningWithoutDuplicates() functions to match spatialPartitioning()
Added a GenericUniqueSpatialPartitioner that wraps an existing partitioner producing a (deterministic) single result on placeObject().
Wired up the functions to Python

How was this patch tested?

Tests were added in Java and Python.

Did this PR include necessary documentation updates?

New API, will add once the approach is validated in principle!

...ommon/src/main/java/org/apache/sedona/core/spatialPartitioning/GenericUniquePartitioner.java

paleolimbot · 2025-01-30T18:55:11Z

python/tests/test_base.py

I'm happy to move this to another PR...it's basically what it takes to get VSCode's Python testing integration to enable run/debug/discover.

paleolimbot · 2025-01-30T18:57:34Z

spark/common/src/main/java/org/apache/sedona/core/spatialPartitioning/SpatialPartitioner.java

+  protected SpatialPartitioner() {
+    gridType = null;
+    grids = null;
+  }


A cleaner way to do this would probably be to remove the gridType and grids field and make getGridType() and getGrids() abstract? I could also remove this constructror and just pass the GenericUniquePartitioner's parent grids/grid type through here.

paleolimbot · 2025-01-30T18:58:45Z

...ommon/src/main/java/org/apache/sedona/core/spatialPartitioning/GenericUniquePartitioner.java

+    Iterator<Tuple2<Integer, Geometry>> it = parent.placeObject(spatialObject);
+    int minParitionId = Integer.MAX_VALUE;
+    Geometry minGeometry = null;
+    while (it.hasNext()) {
+      Tuple2<Integer, Geometry> value = it.next();
+      if (value._1() < minParitionId) {
+        minParitionId = value._1();
+        minGeometry = value._2();
+      }
+    }
+
+    HashSet<Tuple2<Integer, Geometry>> out = new HashSet<Tuple2<Integer, Geometry>>();
+    if (minGeometry != null) {
+      out.add(new Tuple2<Integer, Geometry>(minParitionId, minGeometry));
+    }
+
+    return out.iterator();


I could also just take it.Next() here (i.e., always return the first one) and modify the placeObject implementations to not return an iterator off of a HashSet (i.e., write our own implementation of Iterator, which might be better anyway).

The motivation here is to ensure that the output is deterministic (i.e., if you set grids and ask Sedona to partition, you'll get the same result if you run your pipeline today or tomorrow).

I'm OK with your current approach. We'd better add comment to the code selecting the partitioning result with the minimum partition id to signify that this is for producing consistent result.

paleolimbot · 2025-01-30T19:00:20Z

spark/common/src/test/resources/.gitignore

@@ -1,2 +1,3 @@
 *.DS_Store
 real-*
+wkb/testSaveAs*


Again, I can move this to another PR (running the Python tests creates quite a lot of files here!)

paleolimbot · 2025-01-30T20:32:16Z

spark/common/src/main/java/org/apache/sedona/core/spatialRDD/SpatialRDD.java

@@ -278,7 +304,7 @@ public void spatialPartitioning(SpatialPartitioner partitioner) {

  /** @deprecated Use spatialPartitioning(SpatialPartitioner partitioner) */
  public boolean spatialPartitioning(final List<Envelope> otherGrids) throws Exception {
-    this.partitioner = new FlatGridPartitioner(otherGrids);
+    this.partitioner = new IndexedGridPartitioner(otherGrids);


I think this probably is a better default (log complexity vs. linear complexity for placement). It's important to support this one for the partitioning use case (enables composability between the "generate my grids" and "partition my data" steps).

if we think this is important should we un-deprecate it? I have a similar pattern in an internal feature im building at work, but I create the IndexedGridParitioner myself because this API is deprecated.

Is it right to include the overflow partition here? perhaps this is why the API is deprecated, so you can configure the spatial partitioner on its own.

if we think this is important should we un-deprecate it?

That's a great point! I added it here because it's invoked by Python for the non-unique case, and my test uses it. I think it may be more useful to add a docstring explaining when it is appropriate to use this mechanism (partitioning, testing, and your use case are I think all good ones; load balancing a partitioning scheme for a join is not, which is I think what Jia was afraid would be the case). The indexed version of the flat grid I think removes the performance issue with it if that was a concern.

Is it right to include the overflow partition here?

I think so...if it's not included will rows be removed from the output? I don't think there's a performance issue with including it.

if it's not included will rows be removed from the output?

Correct. best to leave it in

paleolimbot · 2025-01-30T20:32:41Z

.../common/src/main/java/org/apache/sedona/core/spatialPartitioning/IndexedGridPartitioner.java

@@ -48,7 +48,7 @@ public IndexedGridPartitioner(
  }

  public IndexedGridPartitioner(GridType gridType, List<Envelope> grids) {
-    this(gridType, grids, false);
+    this(gridType, grids, true);


This aligns the default between the IndexedGridPartitioner and the FlatGridPartitioner

james-willis · 2025-01-31T19:28:55Z

spark/common/src/main/java/org/apache/sedona/core/spatialRDD/SpatialRDD.java

@@ -278,7 +304,7 @@ public void spatialPartitioning(SpatialPartitioner partitioner) {

  /** @deprecated Use spatialPartitioning(SpatialPartitioner partitioner) */
  public boolean spatialPartitioning(final List<Envelope> otherGrids) throws Exception {
-    this.partitioner = new FlatGridPartitioner(otherGrids);
+    this.partitioner = new IndexedGridPartitioner(otherGrids);


if we think this is important should we un-deprecate it? I have a similar pattern in an internal feature im building at work, but I create the IndexedGridParitioner myself because this API is deprecated.

james-willis · 2025-01-31T19:30:24Z

spark/common/src/main/java/org/apache/sedona/core/spatialRDD/SpatialRDD.java

@@ -278,7 +304,7 @@ public void spatialPartitioning(SpatialPartitioner partitioner) {

  /** @deprecated Use spatialPartitioning(SpatialPartitioner partitioner) */
  public boolean spatialPartitioning(final List<Envelope> otherGrids) throws Exception {
-    this.partitioner = new FlatGridPartitioner(otherGrids);
+    this.partitioner = new IndexedGridPartitioner(otherGrids);


Is it right to include the overflow partition here? perhaps this is why the API is deprecated, so you can configure the spatial partitioner on its own.

james-willis · 2025-01-31T19:30:54Z

spark/common/src/main/java/org/apache/sedona/core/spatialRDD/SpatialRDD.java

+  }
+
+  /** @deprecated Use spatialPartitioningWithoutDuplicates(SpatialPartitioner partitioner) */
+  public boolean spatialPartitioningWithoutDuplicates(final List<Envelope> otherGrids)


why write a new deprecated API?

Your idea of un-deprecating this is a good one...I had to add this because it's called from Python if you pass a list of envelopes (and that's how I test this). I'll turn this into a docstring with some useful content on the appropriate uses of this.

james-willis · 2025-01-31T19:32:08Z

spark/common/src/main/java/org/apache/sedona/core/spatialRDD/SpatialRDD.java

@@ -159,6 +159,32 @@ public boolean spatialPartitioning(GridType gridType) throws Exception {
    return true;
  }

+  public boolean spatialParitioningWithoutDuplicates(GridType gridType) throws Exception {


is withoutDuplicates better as its own set of methods or as a bool flag? No strong opinion but something to consider.

I do tend to have a preference towards methods (but also happy to change this to the prevailing opinion!). I know most Java IDEs let you toggle inlays that make this mute, but my theory was that spatialParitioningWithoutDuplicates(GridType.KDBTREE) is slightly more informative to read than spatialPartitioning(GridType.KDBTREE, false).

The python API has introduce_duplicates added as an optional parameter. I prefer making python API consistent with the Java API by adding a new spatialParitioningWithoutDuplicates method.

paleolimbot · 2025-02-05T21:53:46Z

@jiayuasu @Kontinuation @james-willis I've rebased and added some bits that weren't covered in the StructuredAdapter PR. Are there any more comments here that I missed?

docs/tutorial/sql.md

github-actions bot added sedona-python sedona-spark labels Jan 29, 2025

jiayuasu reviewed Jan 30, 2025

View reviewed changes

...ommon/src/main/java/org/apache/sedona/core/spatialPartitioning/GenericUniquePartitioner.java Outdated Show resolved Hide resolved

paleolimbot commented Jan 30, 2025

View reviewed changes

paleolimbot marked this pull request as ready for review January 30, 2025 20:32

james-willis suggested changes Jan 31, 2025

View reviewed changes

paleolimbot added 15 commits February 4, 2025 21:11

add generic unique partitioner

f5c8437

first pass

67afcb8

format, python

79570ab

remove parent class contructor constraint

a6e7a09

add java test

b588be8

make it so I can run pytest

cc5324d

ignore files generated during Python test

1bb9351

add python tests

c3a256d

check something

bb7cbff

align default for the indexed grid partitioner

dc4c3ce

maybe fix adapter error

d4eb46c

add comment

d19e81f

update python API

fa4691d

add docs

6140547

spotless

ac137e9

paleolimbot force-pushed the unique-partitioner branch from 56c3a63 to ac137e9 Compare February 4, 2025 21:17

paleolimbot added 3 commits February 4, 2025 21:25

port documentation

9c0a184

port test over

d709d06

port scala test

f0e70d3

github-actions bot added the docs label Feb 4, 2025

paleolimbot added 2 commits February 4, 2025 21:54

actually use without dups

aaa6603

maybe only warn sometimes

827a0e7

jiayuasu requested changes Feb 6, 2025

View reviewed changes

docs/tutorial/sql.md Outdated Show resolved Hide resolved

docs/tutorial/sql.md Outdated Show resolved Hide resolved

docs/tutorial/sql.md Outdated Show resolved Hide resolved

docs/tutorial/sql.md Show resolved Hide resolved

jiayuasu and others added 5 commits February 5, 2025 19:36

Update docs/tutorial/sql.md

2202653

Update docs/tutorial/sql.md

8189c82

Update docs/tutorial/sql.md

5acabb5

Update docs/tutorial/sql.md

f225522

lint

e67e454

jiayuasu added this to the sedona-1.7.1 milestone Feb 6, 2025

jiayuasu added improvement affect public APIs labels Feb 6, 2025

jiayuasu linked an issue Feb 6, 2025 that may be closed by this pull request

Preserve Spatial Partitioning From RDD to Dataframe #1268

Closed

jiayuasu approved these changes Feb 6, 2025

View reviewed changes

jiayuasu merged commit 1260245 into apache:master Feb 6, 2025
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SEDONA-705] Add unique partitioner wrapper to enable partitioned writes with Sedona #1778

[SEDONA-705] Add unique partitioner wrapper to enable partitioned writes with Sedona #1778

paleolimbot commented Jan 29, 2025 •

edited

Loading

paleolimbot Jan 30, 2025

paleolimbot Jan 30, 2025

paleolimbot Jan 30, 2025

Kontinuation Feb 3, 2025

paleolimbot Jan 30, 2025

paleolimbot Jan 30, 2025

james-willis Jan 31, 2025 •

edited

Loading

james-willis Jan 31, 2025

paleolimbot Feb 1, 2025 •

edited

Loading

james-willis Feb 3, 2025

paleolimbot Jan 30, 2025

james-willis Jan 31, 2025 •

edited

Loading

james-willis Jan 31, 2025

james-willis Jan 31, 2025

paleolimbot Feb 1, 2025

james-willis Jan 31, 2025

paleolimbot Feb 1, 2025

Kontinuation Feb 3, 2025

paleolimbot Feb 5, 2025

paleolimbot commented Feb 5, 2025

[SEDONA-705] Add unique partitioner wrapper to enable partitioned writes with Sedona #1778

[SEDONA-705] Add unique partitioner wrapper to enable partitioned writes with Sedona #1778

Conversation

paleolimbot commented Jan 29, 2025 • edited Loading

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

james-willis Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot Feb 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

james-willis Jan 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paleolimbot commented Feb 5, 2025

paleolimbot commented Jan 29, 2025 •

edited

Loading

james-willis Jan 31, 2025 •

edited

Loading

paleolimbot Feb 1, 2025 •

edited

Loading

james-willis Jan 31, 2025 •

edited

Loading