-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better support for binary predicates with large inputs. #1166
Better support for binary predicates with large inputs. #1166
Conversation
Why do we need such large tests to catch bugs? Is there a different code path taken for the larger tests that wasn't being exercised? |
…pes due to slice construction.
…h group as a single test.
There are three reasons that larger tests revealed some new bugs:
When I added
|
I guess the concern for larger test is that it takes more time to run CI. Can you post the time it takes to run this test?
This sounds like it's a performance issue, not a bug?
I assume you mean
Does the large input help developer easier locate the corner cases, or does it make it harder? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small change request regarding reducing duplication and better documentation. Also can you file an issue describing the bug that this PR is trying to fix?
result = contains + intersects >= rhs.sizes - polygon_size_reduction | ||
return result | ||
|
||
def _test_interior(self, lhs, rhs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wish there's a more intuitive name, what does this function do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is addressed in #1186, wait till this is merged and review it and this function is removed. :)
The large test take 21.77s on my machine. I'll modify them to the cartesian product of the
It is a performance issue. As mentioned,
Yes, sorry,
I can, I think, reproduce the effect of the large input just by building the cartesian product of the features provided in each |
If that's the case you can keep them in a single PR. Just make the name of the tests intuitive enough to identify the corner cases.
To be fair, it's good to have fuzzy tests like this that unearth bugs that are unnoticed with smaller inputs. It's just they are could be too expensive to run. If the fuzzy test can be equivalently replaced by a few corner cases, that would be awesome. If not, we need to think of the trade off between having them run in the CI versus the chances that they might catch additional regressions. Do you expect that future binpred development can benefit from having the fuzzy test run in the CI that can prevent you from unexpected regressions? |
We can also put the long-running "fuzzy" tests in nightly tests and skip them in PR tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The testing looks more thorough. Some minor suggestions on comments and one-liners, and one comment on the complexity of the boolean pip output reshaping...
On a positive note, this PR actually seems to reduce total pytest time on my machine compared to before this PR.
The growth in warnings is concerning, and I think it's something we should work to reduce. |
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
…erly.py Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
I've handled all your reviews thus far. |
/merge |
Description
Closes #1142
This PR adds a few bugfixes and optimizations that improve performance when large
GeoSeries
are used with binary predicates. It also corrects a few errors in the predicate logic that were revealed when the size of the feature space increased by combining all possible features in thedispatch_list
.Changes:
contains.py
pairwise_point_in_polygon
and steps to resemblequadtree
results.contains_geometry_processor.py
is True
and add a TODO for future optimization.feature_contains.py
_compute_polygon_linestring_contains
to handleGeoSeries
containingLineStrings
of varying lengths.feature_contains_properly.py
pairwise_point_in_polygon
as default mode with documentation.PointMultiPointContains
which is needed by internal methods.feature_crosses.py
intersection
feature_disjoint.py
PointPointDisjoint
and drop extraneousintersections
.feature_equals.py
feature_intersects.py
intersection
feature_touches.py
geoseries.py
input_types
.point_indices
forPoint
type.reset_index
which was doing a host->device copy.binpred_test_dispatch.py
test_binpred_large_examples.py
test_equals_only_binpreds.py
test_binpred_large_examples.py
test_dispatch
to create largeGeoSeries
and compare results withGeoPandas
.test_feature_groups.py
dispatch_list
feature sets combined into a single GeoSeries.binpred_utils.py
_basic_contains_count
)._points_and_lines_to_multipoints
column_utils.py
contains_only
calls.Checklist