Better support for binary predicates with large inputs. #1166

thomcom · 2023-06-01T01:34:35Z

Description

This PR adds a few bugfixes and optimizations that improve performance when large GeoSeries are used with binary predicates. It also corrects a few errors in the predicate logic that were revealed when the size of the feature space increased by combining all possible features in the dispatch_list.

Changes:
contains.py

Add pairwise_point_in_polygon and steps to resemble quadtree results.

contains_geometry_processor.py

Drop is True and add a TODO for future optimization.

feature_contains.py

Refactor _compute_polygon_linestring_contains to handle GeoSeries containing LineStrings of varying lengths.

feature_contains_properly.py

Add pairwise_point_in_polygon as default mode with documentation.
Add PointMultiPointContains which is needed by internal methods.

feature_crosses.py

Drop extraneous intersection

feature_disjoint.py

Add PointPointDisjoint and drop extraneous intersections.

feature_equals.py

Fix LineStringLineStringEquals which wasn't properly handling LineStrings with varying lengths.

feature_intersects.py

Drop extraneous intersection

feature_touches.py

Fix LineStringLineStringTouches. It is slow and needs further optimization.
Fix PolygonPolygonTouches. It is also slow and needs further optimization.

geoseries.py

Drop index from input_types.
Fix point_indices for Point type.
Optimize reset_index which was doing a host->device copy.

binpred_test_dispatch.py

Add test case
test_binpred_large_examples.py
Test large sets of all the dispatched tests together.

test_equals_only_binpreds.py

Test corrections to input_types indexes.

test_binpred_large_examples.py

Use the features from test_dispatch to create large GeoSeries and compare results with GeoPandas.

test_feature_groups.py

Test each of the dispatch_list feature sets combined into a single GeoSeries.

binpred_utils.py

Don't count hits when point and polygon indexes don't match (a bug in _basic_contains_count).
Optimize mask generation in _points_and_lines_to_multipoints

column_utils.py

Optimize contains_only calls.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

harrism · 2023-06-01T06:09:24Z

Why do we need such large tests to catch bugs? Is there a different code path taken for the larger tests that wasn't being exercised?

…pes due to slice construction.

…h group as a single test.

thomcom · 2023-06-05T20:25:39Z

There are three reasons that larger tests revealed some new bugs:

I added pairwise_point_in_polygon for the benchmark notebook so that it was possible to run the .contains benchmarks on large enough pairwise datasets to demonstrate performance. Performance for this implementation is still poor, but allows for large inputs. brute_force only allows comparison of 31 points and 31 polygons, which is outperformed by GeoPandas by not having to use any python<->GPU overhead. Similarly quadtree_point_in_polygon only works on up to approximately 10,000 points and 10,000 polygons, which also doesn't demonstrate significant performance gains over direct CPU computation.

When I added pairwise_point_in_polygon I didn't build the results DataFrame correctly, which resulted in incorrect results for most of the .contains tests. These failures were not caught in unit testing because the test_dispatch tests only test 3 features on each side, which uses the brute_force algorithm. The good news is that none of these regressions are in 23.06, which works as expected for all inputs as it uses only brute_force or quadtree.

The way that the large geoseries were being generated for the notebook and the large tests is by repeated subindexing, that is, to slice a short GeoSeries by a long array of random, repeated values all within the range of the short GeoSeries, producing a large GeoSeries with many repeated geometries. This method revealed an error in the underlying GeoArrow index_types array, because the index of the index_types was equal to the original repeating slice. That bug caused a few of the results counting and feature merging predicates to fail.
Creating large GeoSeries of every possible lhs value and rhs value effectively built the cartesian product of all the features in a particular dispatch_list, which revealed a couple more cases of combinations that weren't correctly handled by the binary predicate.

isVoid · 2023-06-05T22:16:12Z

I guess the concern for larger test is that it takes more time to run CI. Can you post the time it takes to run this test?

I added pairwise_point_in_polygon for the benchmark notebook so that it was possible to run the .contains benchmarks on large enough pairwise datasets to demonstrate performance. Performance for this implementation is still poor, but allows for large inputs. brute_force only allows comparison of 31 points and 31 polygons, which is outperformed by GeoPandas by not having to use any python<->GPU overhead. Similarly quadtree_point_in_polygon only works on up to approximately 10,000 points and 10,000 polygons, which also doesn't demonstrate significant performance gains over direct CPU computation.

This sounds like it's a performance issue, not a bug?

The way that the large geoseries were being generated for the notebook and the large tests is by repeated subindexing, that is, to slice a short GeoSeries by a long array of random, repeated values all within the range of the short GeoSeries, producing a large GeoSeries with many repeated geometries. This method revealed an error in the underlying GeoArrow index_types array, because the index of the index_types was equal to the original repeating slice. That bug caused a few of the results counting and feature merging predicates to fail.

I assume you mean input_types -> index_types. Nevertheless, if input_types is wrongly indexed due to it has an additional level of reindexing, it sounds like you can easily create a small reproducer that doesn't need large input?

Creating large GeoSeries of every possible lhs value and rhs value effectively built the cartesian product of all the features in a particular dispatch_list, which revealed a couple more cases of combinations that weren't correctly handled by the binary predicate.

Does the large input help developer easier locate the corner cases, or does it make it harder?

isVoid

Small change request regarding reducing duplication and better documentation. Also can you file an issue describing the bug that this PR is trying to fix?

python/cuspatial/cuspatial/core/binpreds/contains.py

python/cuspatial/cuspatial/core/binpreds/feature_contains.py

isVoid · 2023-06-06T14:55:09Z

python/cuspatial/cuspatial/core/binpreds/feature_contains.py

+        result = contains + intersects >= rhs.sizes - polygon_size_reduction
+        return result
+
+    def _test_interior(self, lhs, rhs):


Wish there's a more intuitive name, what does this function do?

This function is addressed in #1186, wait till this is merged and review it and this function is removed. :)

thomcom · 2023-06-06T15:56:55Z

I guess the concern for larger test is that it takes more time to run CI. Can you post the time it takes to run this test?

The large test take 21.77s on my machine. I'll modify them to the cartesian product of the dispatch_list lhs and rhs, which will decrease the size.

This sounds like it's a performance issue, not a bug?

It is a performance issue. As mentioned, brute_force only supports pairwise operations up to 31x31, quadtree only supports pairwise operations up to ~10,000x10,000, which in neither case is CPU performance slow enough to really benefit from GPU acceleration.

I assume you mean input_types -> index_types. Nevertheless, if input_types is wrongly indexed due to it has an additional level of reindexing, it sounds like you can easily create a small reproducer that doesn't need large input?

Yes, sorry, input_types as per GeoMeta. I'm separating this PR into two separate PRs: one that handles the input_types slicing bug, and another that adds the additional binpred fixes relating to testing the cartesian product of the dispatch_list. However the latter PR will depend on the former PR, which doesn't necessarily make review easier.

Does the large input help developer easier locate the corner cases, or does it make it harder?

I can, I think, reproduce the effect of the large input just by building the cartesian product of the features provided in each dispatch_list, which will make it easier to locate corner cases.

isVoid · 2023-06-06T16:07:40Z

I'm separating this PR into two separate PRs: one that handles the input_types slicing bug, and another that adds the additional binpred fixes relating to testing the cartesian product of the dispatch_list. However the latter PR will depend on the former PR, which doesn't necessarily make review easier.

If that's the case you can keep them in a single PR. Just make the name of the tests intuitive enough to identify the corner cases.

Does the large input help developer easier locate the corner cases, or does it make it harder?

I can, I think, reproduce the effect of the large input just by building the cartesian product of the features provided in each dispatch_list, which will make it easier to locate corner cases.

To be fair, it's good to have fuzzy tests like this that unearth bugs that are unnoticed with smaller inputs. It's just they are could be too expensive to run. If the fuzzy test can be equivalently replaced by a few corner cases, that would be awesome. If not, we need to think of the trade off between having them run in the CI versus the chances that they might catch additional regressions. Do you expect that future binpred development can benefit from having the fuzzy test run in the CI that can prevent you from unexpected regressions?

…tesian tests.

harrism · 2023-06-07T01:41:12Z

If not, we need to think of the trade off between having them run in the CI versus the chances that they might catch additional regressions.

We can also put the long-running "fuzzy" tests in nightly tests and skip them in PR tests.

harrism

The testing looks more thorough. Some minor suggestions on comments and one-liners, and one comment on the complexity of the boolean pip output reshaping...

python/cuspatial/cuspatial/core/binpreds/contains.py

python/cuspatial/cuspatial/core/binpreds/feature_disjoint.py

python/cuspatial/cuspatial/core/binpreds/feature_intersects.py

harrism · 2023-06-07T02:47:25Z

On a positive note, this PR actually seems to reduce total pytest time on my machine compared to before this PR.

23.04
=========================== 525 passed, 39 skipped, 81 warnings in 45.43s ============================
23.06
================ 953 passed, 19 skipped, 3 xfailed, 830 warnings in 294.19s (0:04:54) ================
23.08 (pre PR 1166)
================ 953 passed, 19 skipped, 3 xfailed, 830 warnings in 302.47s (0:05:02) ================
23.08 (post PR 1166)
=============== 1071 passed, 19 skipped, 3 xfailed, 927 warnings in 132.47s (0:02:12) ================

The growth in warnings is concerning, and I think it's something we should work to reduce.

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

…erly.py Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

thomcom · 2023-06-07T21:11:07Z

I've handled all your reviews thus far.

thomcom · 2023-06-08T14:16:06Z

/merge

Add files that support binary_predicates.ipynb PR.

3e9e10b

thomcom added bug Something isn't working ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround 2 - In Progress Currenty a work in progress non-breaking Non-breaking change labels Jun 1, 2023

thomcom added this to the DE-9IM milestone Jun 1, 2023

thomcom requested review from harrism, isVoid and jarmak-nv June 1, 2023 01:34

thomcom self-assigned this Jun 1, 2023

thomcom requested a review from a team as a code owner June 1, 2023 01:34

github-actions bot added the Python Related to Python code label Jun 1, 2023

Add large binpred tests.

3332ab9

thomcom removed the request for review from harrism June 1, 2023 14:50

thomcom added 5 commits June 1, 2023 18:44

Fix two indexing issues.

fa9bbe6

Fix issue with polygon.contains(linestring) not being generic enough.

2117883

Fix issue with linestring.geom_equals(linestring) having bad input_ty…

93b3b92

…pes due to slice construction.

Rest of bugfixes for large datasets.

3af37c3

Add test_feature_groups which tests via binpred dispatch but runs eac…

6dc2169

…h group as a single test.

thomcom changed the base branch from branch-23.06 to branch-23.08 June 5, 2023 15:51

thomcom added 4 commits June 5, 2023 10:51

Merge branch 'branch-23.08' into bug/binpred-fail-with-large-datasets

875ad9f

Fix test import statement.

46cdb8c

Clean up contains changes and comments.

7f950dd

Merge branch 'branch-23.08' into bug/binpred-fail-with-large-datasets

06e8f06

rapidsai deleted a comment from harrism Jun 5, 2023

thomcom removed the ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround label Jun 5, 2023

thomcom requested a review from harrism June 5, 2023 20:28

Merge branch 'branch-23.08' into bug/binpred-fail-with-large-datasets

60391dd

isVoid requested changes Jun 6, 2023

View reviewed changes

Implement a couple of review suggestions and drop large tests for car…

217a99a

…tesian tests.

harrism requested changes Jun 7, 2023

View reviewed changes

thomcom and others added 12 commits June 7, 2023 09:10

Update python/cuspatial/cuspatial/core/binpreds/contains.py

bf77069

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

Update python/cuspatial/cuspatial/core/binpreds/contains.py

c8cb7ea

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

Update python/cuspatial/cuspatial/core/binpreds/contains.py

2bf49e3

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

Update python/cuspatial/cuspatial/core/binpreds/feature_contains_prop…

285b0c7

…erly.py Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

Update python/cuspatial/cuspatial/core/binpreds/feature_disjoint.py

7889603

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

Update python/cuspatial/cuspatial/core/binpreds/feature_disjoint.py

a28ec99

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

Update python/cuspatial/cuspatial/core/binpreds/feature_intersects.py

a94c232

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

Update python/cuspatial/cuspatial/core/binpreds/feature_intersects.py

8e38e8e

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

Update python/cuspatial/cuspatial/core/binpreds/feature_intersects.py

6e9884f

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

Update python/cuspatial/cuspatial/core/binpreds/feature_intersects.py

d0d1888

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>

Optimize result processing for brute_force and pairwise.

aa49915

Drop unused function.

6d1bd1b

thomcom mentioned this pull request Jun 7, 2023

Improve Polygon.contains(LineString) predicate logic. #1186

Merged

3 tasks

thomcom added 2 commits June 7, 2023 12:12

Merge branch 'branch-23.08' into bug/binpred-fail-with-large-datasets

dce140a

Flake8 issues unfixed due to github commits.

29abbbf

harrism approved these changes Jun 8, 2023

View reviewed changes

isVoid approved these changes Jun 8, 2023

View reviewed changes

rapids-bot bot merged commit 1d2c174 into rapidsai:branch-23.08 Jun 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for binary predicates with large inputs. #1166

Better support for binary predicates with large inputs. #1166

thomcom commented Jun 1, 2023 •

edited

Loading

harrism commented Jun 1, 2023

thomcom commented Jun 5, 2023 •

edited

Loading

isVoid commented Jun 5, 2023 •

edited

Loading

isVoid left a comment

isVoid Jun 6, 2023

thomcom Jun 7, 2023

thomcom commented Jun 6, 2023

isVoid commented Jun 6, 2023

harrism commented Jun 7, 2023

harrism left a comment

harrism commented Jun 7, 2023

thomcom commented Jun 7, 2023

thomcom commented Jun 8, 2023

Better support for binary predicates with large inputs. #1166

Better support for binary predicates with large inputs. #1166

Conversation

thomcom commented Jun 1, 2023 • edited Loading

Description

Checklist

harrism commented Jun 1, 2023

thomcom commented Jun 5, 2023 • edited Loading

isVoid commented Jun 5, 2023 • edited Loading

isVoid left a comment

Choose a reason for hiding this comment

isVoid Jun 6, 2023

Choose a reason for hiding this comment

thomcom Jun 7, 2023

Choose a reason for hiding this comment

thomcom commented Jun 6, 2023

isVoid commented Jun 6, 2023

harrism commented Jun 7, 2023

harrism left a comment

Choose a reason for hiding this comment

harrism commented Jun 7, 2023

thomcom commented Jun 7, 2023

thomcom commented Jun 8, 2023

thomcom commented Jun 1, 2023 •

edited

Loading

thomcom commented Jun 5, 2023 •

edited

Loading

isVoid commented Jun 5, 2023 •

edited

Loading