-
Notifications
You must be signed in to change notification settings - Fork 692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SEDONA-705] Add unique partitioner wrapper to enable partitioned writes with Sedona #1778
Conversation
...ommon/src/main/java/org/apache/sedona/core/spatialPartitioning/GenericUniquePartitioner.java
Outdated
Show resolved
Hide resolved
python/tests/test_base.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to move this to another PR...it's basically what it takes to get VSCode's Python testing integration to enable run/debug/discover.
protected SpatialPartitioner() { | ||
gridType = null; | ||
grids = null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A cleaner way to do this would probably be to remove the gridType
and grids
field and make getGridType()
and getGrids()
abstract? I could also remove this constructror and just pass the GenericUniquePartitioner
's parent grids/grid type through here.
Iterator<Tuple2<Integer, Geometry>> it = parent.placeObject(spatialObject); | ||
int minParitionId = Integer.MAX_VALUE; | ||
Geometry minGeometry = null; | ||
while (it.hasNext()) { | ||
Tuple2<Integer, Geometry> value = it.next(); | ||
if (value._1() < minParitionId) { | ||
minParitionId = value._1(); | ||
minGeometry = value._2(); | ||
} | ||
} | ||
|
||
HashSet<Tuple2<Integer, Geometry>> out = new HashSet<Tuple2<Integer, Geometry>>(); | ||
if (minGeometry != null) { | ||
out.add(new Tuple2<Integer, Geometry>(minParitionId, minGeometry)); | ||
} | ||
|
||
return out.iterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also just take it.Next()
here (i.e., always return the first one) and modify the placeObject
implementations to not return an iterator off of a HashSet
(i.e., write our own implementation of Iterator
, which might be better anyway).
The motivation here is to ensure that the output is deterministic (i.e., if you set grids and ask Sedona to partition, you'll get the same result if you run your pipeline today or tomorrow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with your current approach. We'd better add comment to the code selecting the partitioning result with the minimum partition id to signify that this is for producing consistent result.
@@ -1,2 +1,3 @@ | |||
*.DS_Store | |||
real-* | |||
wkb/testSaveAs* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I can move this to another PR (running the Python tests creates quite a lot of files here!)
@@ -278,7 +304,7 @@ public void spatialPartitioning(SpatialPartitioner partitioner) { | |||
|
|||
/** @deprecated Use spatialPartitioning(SpatialPartitioner partitioner) */ | |||
public boolean spatialPartitioning(final List<Envelope> otherGrids) throws Exception { | |||
this.partitioner = new FlatGridPartitioner(otherGrids); | |||
this.partitioner = new IndexedGridPartitioner(otherGrids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this probably is a better default (log complexity vs. linear complexity for placement). It's important to support this one for the partitioning use case (enables composability between the "generate my grids" and "partition my data" steps).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we think this is important should we un-deprecate it? I have a similar pattern in an internal feature im building at work, but I create the IndexedGridParitioner myself because this API is deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it right to include the overflow partition here? perhaps this is why the API is deprecated, so you can configure the spatial partitioner on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we think this is important should we un-deprecate it?
That's a great point! I added it here because it's invoked by Python for the non-unique case, and my test uses it. I think it may be more useful to add a docstring explaining when it is appropriate to use this mechanism (partitioning, testing, and your use case are I think all good ones; load balancing a partitioning scheme for a join is not, which is I think what Jia was afraid would be the case). The indexed version of the flat grid I think removes the performance issue with it if that was a concern.
Is it right to include the overflow partition here?
I think so...if it's not included will rows be removed from the output? I don't think there's a performance issue with including it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's not included will rows be removed from the output?
Correct. best to leave it in
@@ -48,7 +48,7 @@ public IndexedGridPartitioner( | |||
} | |||
|
|||
public IndexedGridPartitioner(GridType gridType, List<Envelope> grids) { | |||
this(gridType, grids, false); | |||
this(gridType, grids, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This aligns the default between the IndexedGridPartitioner
and the FlatGridPartitioner
@@ -278,7 +304,7 @@ public void spatialPartitioning(SpatialPartitioner partitioner) { | |||
|
|||
/** @deprecated Use spatialPartitioning(SpatialPartitioner partitioner) */ | |||
public boolean spatialPartitioning(final List<Envelope> otherGrids) throws Exception { | |||
this.partitioner = new FlatGridPartitioner(otherGrids); | |||
this.partitioner = new IndexedGridPartitioner(otherGrids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we think this is important should we un-deprecate it? I have a similar pattern in an internal feature im building at work, but I create the IndexedGridParitioner myself because this API is deprecated.
@@ -278,7 +304,7 @@ public void spatialPartitioning(SpatialPartitioner partitioner) { | |||
|
|||
/** @deprecated Use spatialPartitioning(SpatialPartitioner partitioner) */ | |||
public boolean spatialPartitioning(final List<Envelope> otherGrids) throws Exception { | |||
this.partitioner = new FlatGridPartitioner(otherGrids); | |||
this.partitioner = new IndexedGridPartitioner(otherGrids); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it right to include the overflow partition here? perhaps this is why the API is deprecated, so you can configure the spatial partitioner on its own.
} | ||
|
||
/** @deprecated Use spatialPartitioningWithoutDuplicates(SpatialPartitioner partitioner) */ | ||
public boolean spatialPartitioningWithoutDuplicates(final List<Envelope> otherGrids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why write a new deprecated API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your idea of un-deprecating this is a good one...I had to add this because it's called from Python if you pass a list of envelopes (and that's how I test this). I'll turn this into a docstring with some useful content on the appropriate uses of this.
@@ -159,6 +159,32 @@ public boolean spatialPartitioning(GridType gridType) throws Exception { | |||
return true; | |||
} | |||
|
|||
public boolean spatialParitioningWithoutDuplicates(GridType gridType) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is withoutDuplicates better as its own set of methods or as a bool flag? No strong opinion but something to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do tend to have a preference towards methods (but also happy to change this to the prevailing opinion!). I know most Java IDEs let you toggle inlays that make this mute, but my theory was that spatialParitioningWithoutDuplicates(GridType.KDBTREE)
is slightly more informative to read than spatialPartitioning(GridType.KDBTREE, false)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python API has introduce_duplicates
added as an optional parameter. I prefer making python API consistent with the Java API by adding a new spatialParitioningWithoutDuplicates
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
56c3a63
to
ac137e9
Compare
@jiayuasu @Kontinuation @james-willis I've rebased and added some bits that weren't covered in the StructuredAdapter PR. Are there any more comments here that I missed? |
Did you read the Contributor Guide?
Is this PR related to a JIRA ticket?
[SEDONA-705] my subject
.What changes were proposed in this PR?
After SEDONA-695 (#1751), we will have the ability to do partitioned writes! To be useful in most contexts, we also need the ability to create those partitions assigning a unique partition to each feature (i.e., don't introduce duplicates).
spatialPartitioningWithoutDuplicates()
functions to matchspatialPartitioning()
GenericUniqueSpatialPartitioner
that wraps an existing partitioner producing a (deterministic) single result onplaceObject()
.How was this patch tested?
Tests were added in Java and Python.
Did this PR include necessary documentation updates?