Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SEDONA-705] Add unique partitioner wrapper to enable partitioned writes with Sedona #1778

Merged
merged 25 commits into from
Feb 6, 2025

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Jan 29, 2025

Did you read the Contributor Guide?

Is this PR related to a JIRA ticket?

What changes were proposed in this PR?

After SEDONA-695 (#1751), we will have the ability to do partitioned writes! To be useful in most contexts, we also need the ability to create those partitions assigning a unique partition to each feature (i.e., don't introduce duplicates).

  • Added a set of spatialPartitioningWithoutDuplicates() functions to match spatialPartitioning()
  • Added a GenericUniqueSpatialPartitioner that wraps an existing partitioner producing a (deterministic) single result on placeObject().
  • Wired up the functions to Python

How was this patch tested?

Tests were added in Java and Python.

Did this PR include necessary documentation updates?

  • New API, will add once the approach is validated in principle!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to move this to another PR...it's basically what it takes to get VSCode's Python testing integration to enable run/debug/discover.

Comment on lines +38 to +41
protected SpatialPartitioner() {
gridType = null;
grids = null;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A cleaner way to do this would probably be to remove the gridType and grids field and make getGridType() and getGrids() abstract? I could also remove this constructror and just pass the GenericUniquePartitioner's parent grids/grid type through here.

Comment on lines +48 to +67
Iterator<Tuple2<Integer, Geometry>> it = parent.placeObject(spatialObject);
int minParitionId = Integer.MAX_VALUE;
Geometry minGeometry = null;
while (it.hasNext()) {
Tuple2<Integer, Geometry> value = it.next();
if (value._1() < minParitionId) {
minParitionId = value._1();
minGeometry = value._2();
}
}

HashSet<Tuple2<Integer, Geometry>> out = new HashSet<Tuple2<Integer, Geometry>>();
if (minGeometry != null) {
out.add(new Tuple2<Integer, Geometry>(minParitionId, minGeometry));
}

return out.iterator();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also just take it.Next() here (i.e., always return the first one) and modify the placeObject implementations to not return an iterator off of a HashSet (i.e., write our own implementation of Iterator, which might be better anyway).

The motivation here is to ensure that the output is deterministic (i.e., if you set grids and ask Sedona to partition, you'll get the same result if you run your pipeline today or tomorrow).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with your current approach. We'd better add comment to the code selecting the partitioning result with the minimum partition id to signify that this is for producing consistent result.

@@ -1,2 +1,3 @@
*.DS_Store
real-*
wkb/testSaveAs*
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I can move this to another PR (running the Python tests creates quite a lot of files here!)

@@ -278,7 +304,7 @@ public void spatialPartitioning(SpatialPartitioner partitioner) {

/** @deprecated Use spatialPartitioning(SpatialPartitioner partitioner) */
public boolean spatialPartitioning(final List<Envelope> otherGrids) throws Exception {
this.partitioner = new FlatGridPartitioner(otherGrids);
this.partitioner = new IndexedGridPartitioner(otherGrids);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this probably is a better default (log complexity vs. linear complexity for placement). It's important to support this one for the partitioning use case (enables composability between the "generate my grids" and "partition my data" steps).

Copy link
Contributor

@james-willis james-willis Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we think this is important should we un-deprecate it? I have a similar pattern in an internal feature im building at work, but I create the IndexedGridParitioner myself because this API is deprecated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it right to include the overflow partition here? perhaps this is why the API is deprecated, so you can configure the spatial partitioner on its own.

Copy link
Member Author

@paleolimbot paleolimbot Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we think this is important should we un-deprecate it?

That's a great point! I added it here because it's invoked by Python for the non-unique case, and my test uses it. I think it may be more useful to add a docstring explaining when it is appropriate to use this mechanism (partitioning, testing, and your use case are I think all good ones; load balancing a partitioning scheme for a join is not, which is I think what Jia was afraid would be the case). The indexed version of the flat grid I think removes the performance issue with it if that was a concern.

Is it right to include the overflow partition here?

I think so...if it's not included will rows be removed from the output? I don't think there's a performance issue with including it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's not included will rows be removed from the output?

Correct. best to leave it in

@@ -48,7 +48,7 @@ public IndexedGridPartitioner(
}

public IndexedGridPartitioner(GridType gridType, List<Envelope> grids) {
this(gridType, grids, false);
this(gridType, grids, true);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This aligns the default between the IndexedGridPartitioner and the FlatGridPartitioner

@paleolimbot paleolimbot marked this pull request as ready for review January 30, 2025 20:32
@@ -278,7 +304,7 @@ public void spatialPartitioning(SpatialPartitioner partitioner) {

/** @deprecated Use spatialPartitioning(SpatialPartitioner partitioner) */
public boolean spatialPartitioning(final List<Envelope> otherGrids) throws Exception {
this.partitioner = new FlatGridPartitioner(otherGrids);
this.partitioner = new IndexedGridPartitioner(otherGrids);
Copy link
Contributor

@james-willis james-willis Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we think this is important should we un-deprecate it? I have a similar pattern in an internal feature im building at work, but I create the IndexedGridParitioner myself because this API is deprecated.

@@ -278,7 +304,7 @@ public void spatialPartitioning(SpatialPartitioner partitioner) {

/** @deprecated Use spatialPartitioning(SpatialPartitioner partitioner) */
public boolean spatialPartitioning(final List<Envelope> otherGrids) throws Exception {
this.partitioner = new FlatGridPartitioner(otherGrids);
this.partitioner = new IndexedGridPartitioner(otherGrids);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it right to include the overflow partition here? perhaps this is why the API is deprecated, so you can configure the spatial partitioner on its own.

}

/** @deprecated Use spatialPartitioningWithoutDuplicates(SpatialPartitioner partitioner) */
public boolean spatialPartitioningWithoutDuplicates(final List<Envelope> otherGrids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why write a new deprecated API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your idea of un-deprecating this is a good one...I had to add this because it's called from Python if you pass a list of envelopes (and that's how I test this). I'll turn this into a docstring with some useful content on the appropriate uses of this.

@@ -159,6 +159,32 @@ public boolean spatialPartitioning(GridType gridType) throws Exception {
return true;
}

public boolean spatialParitioningWithoutDuplicates(GridType gridType) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is withoutDuplicates better as its own set of methods or as a bool flag? No strong opinion but something to consider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do tend to have a preference towards methods (but also happy to change this to the prevailing opinion!). I know most Java IDEs let you toggle inlays that make this mute, but my theory was that spatialParitioningWithoutDuplicates(GridType.KDBTREE) is slightly more informative to read than spatialPartitioning(GridType.KDBTREE, false).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python API has introduce_duplicates added as an optional parameter. I prefer making python API consistent with the Java API by adding a new spatialParitioningWithoutDuplicates method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@github-actions github-actions bot added the docs label Feb 4, 2025
@paleolimbot
Copy link
Member Author

@jiayuasu @Kontinuation @james-willis I've rebased and added some bits that weren't covered in the StructuredAdapter PR. Are there any more comments here that I missed?

docs/tutorial/sql.md Outdated Show resolved Hide resolved
docs/tutorial/sql.md Outdated Show resolved Hide resolved
docs/tutorial/sql.md Outdated Show resolved Hide resolved
docs/tutorial/sql.md Show resolved Hide resolved
@jiayuasu jiayuasu added this to the sedona-1.7.1 milestone Feb 6, 2025
@jiayuasu jiayuasu linked an issue Feb 6, 2025 that may be closed by this pull request
@jiayuasu jiayuasu merged commit 1260245 into apache:master Feb 6, 2025
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preserve Spatial Partitioning From RDD to Dataframe
4 participants