Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for binary predicates with large inputs. #1166

Merged

Conversation

thomcom
Copy link
Contributor

@thomcom thomcom commented Jun 1, 2023

Description

Closes #1142

This PR adds a few bugfixes and optimizations that improve performance when large GeoSeries are used with binary predicates. It also corrects a few errors in the predicate logic that were revealed when the size of the feature space increased by combining all possible features in the dispatch_list.

Changes:
contains.py

  • Add pairwise_point_in_polygon and steps to resemble quadtree results.

contains_geometry_processor.py

  • Drop is True and add a TODO for future optimization.

feature_contains.py

  • Refactor _compute_polygon_linestring_contains to handle GeoSeries containing LineStrings of varying lengths.

feature_contains_properly.py

  • Add pairwise_point_in_polygon as default mode with documentation.
  • Add PointMultiPointContains which is needed by internal methods.

feature_crosses.py

  • Drop extraneous intersection

feature_disjoint.py

  • Add PointPointDisjoint and drop extraneous intersections.

feature_equals.py

  • Fix LineStringLineStringEquals which wasn't properly handling LineStrings with varying lengths.

feature_intersects.py

  • Drop extraneous intersection

feature_touches.py

  • Fix LineStringLineStringTouches. It is slow and needs further optimization.
  • Fix PolygonPolygonTouches. It is also slow and needs further optimization.

geoseries.py

  • Drop index from input_types.
  • Fix point_indices for Point type.
  • Optimize reset_index which was doing a host->device copy.

binpred_test_dispatch.py

  • Add test case
    test_binpred_large_examples.py
  • Test large sets of all the dispatched tests together.

test_equals_only_binpreds.py

  • Test corrections to input_types indexes.

test_binpred_large_examples.py

  • Use the features from test_dispatch to create large GeoSeries and compare results with GeoPandas.

test_feature_groups.py

  • Test each of the dispatch_list feature sets combined into a single GeoSeries.

binpred_utils.py

  • Don't count hits when point and polygon indexes don't match (a bug in _basic_contains_count).
  • Optimize mask generation in _points_and_lines_to_multipoints

column_utils.py

  • Optimize contains_only calls.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@thomcom thomcom added bug Something isn't working ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround 2 - In Progress Currenty a work in progress non-breaking Non-breaking change labels Jun 1, 2023
@thomcom thomcom added this to the DE-9IM milestone Jun 1, 2023
@thomcom thomcom requested review from harrism, isVoid and jarmak-nv June 1, 2023 01:34
@thomcom thomcom self-assigned this Jun 1, 2023
@thomcom thomcom requested a review from a team as a code owner June 1, 2023 01:34
@github-actions github-actions bot added the Python Related to Python code label Jun 1, 2023
@harrism
Copy link
Member

harrism commented Jun 1, 2023

Why do we need such large tests to catch bugs? Is there a different code path taken for the larger tests that wasn't being exercised?

@thomcom thomcom removed the request for review from harrism June 1, 2023 14:50
@thomcom thomcom changed the base branch from branch-23.06 to branch-23.08 June 5, 2023 15:51
@rapidsai rapidsai deleted a comment from harrism Jun 5, 2023
@thomcom
Copy link
Contributor Author

thomcom commented Jun 5, 2023

There are three reasons that larger tests revealed some new bugs:

  1. I added pairwise_point_in_polygon for the benchmark notebook so that it was possible to run the .contains benchmarks on large enough pairwise datasets to demonstrate performance. Performance for this implementation is still poor, but allows for large inputs. brute_force only allows comparison of 31 points and 31 polygons, which is outperformed by GeoPandas by not having to use any python<->GPU overhead. Similarly quadtree_point_in_polygon only works on up to approximately 10,000 points and 10,000 polygons, which also doesn't demonstrate significant performance gains over direct CPU computation.

When I added pairwise_point_in_polygon I didn't build the results DataFrame correctly, which resulted in incorrect results for most of the .contains tests. These failures were not caught in unit testing because the test_dispatch tests only test 3 features on each side, which uses the brute_force algorithm. The good news is that none of these regressions are in 23.06, which works as expected for all inputs as it uses only brute_force or quadtree.

  1. The way that the large geoseries were being generated for the notebook and the large tests is by repeated subindexing, that is, to slice a short GeoSeries by a long array of random, repeated values all within the range of the short GeoSeries, producing a large GeoSeries with many repeated geometries. This method revealed an error in the underlying GeoArrow index_types array, because the index of the index_types was equal to the original repeating slice. That bug caused a few of the results counting and feature merging predicates to fail.
  2. Creating large GeoSeries of every possible lhs value and rhs value effectively built the cartesian product of all the features in a particular dispatch_list, which revealed a couple more cases of combinations that weren't correctly handled by the binary predicate.

@thomcom thomcom removed the ! - Hotfix Hotfix is a bug that affects the majority of users for which there is no reasonable workaround label Jun 5, 2023
@thomcom thomcom requested a review from harrism June 5, 2023 20:28
@isVoid
Copy link
Contributor

isVoid commented Jun 5, 2023

I guess the concern for larger test is that it takes more time to run CI. Can you post the time it takes to run this test?

I added pairwise_point_in_polygon for the benchmark notebook so that it was possible to run the .contains benchmarks on large enough pairwise datasets to demonstrate performance. Performance for this implementation is still poor, but allows for large inputs. brute_force only allows comparison of 31 points and 31 polygons, which is outperformed by GeoPandas by not having to use any python<->GPU overhead. Similarly quadtree_point_in_polygon only works on up to approximately 10,000 points and 10,000 polygons, which also doesn't demonstrate significant performance gains over direct CPU computation.

This sounds like it's a performance issue, not a bug?

The way that the large geoseries were being generated for the notebook and the large tests is by repeated subindexing, that is, to slice a short GeoSeries by a long array of random, repeated values all within the range of the short GeoSeries, producing a large GeoSeries with many repeated geometries. This method revealed an error in the underlying GeoArrow index_types array, because the index of the index_types was equal to the original repeating slice. That bug caused a few of the results counting and feature merging predicates to fail.

I assume you mean input_types -> index_types. Nevertheless, if input_types is wrongly indexed due to it has an additional level of reindexing, it sounds like you can easily create a small reproducer that doesn't need large input?

Creating large GeoSeries of every possible lhs value and rhs value effectively built the cartesian product of all the features in a particular dispatch_list, which revealed a couple more cases of combinations that weren't correctly handled by the binary predicate.

Does the large input help developer easier locate the corner cases, or does it make it harder?

Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small change request regarding reducing duplication and better documentation. Also can you file an issue describing the bug that this PR is trying to fix?

python/cuspatial/cuspatial/core/binpreds/contains.py Outdated Show resolved Hide resolved
result = contains + intersects >= rhs.sizes - polygon_size_reduction
return result

def _test_interior(self, lhs, rhs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wish there's a more intuitive name, what does this function do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is addressed in #1186, wait till this is merged and review it and this function is removed. :)

@thomcom
Copy link
Contributor Author

thomcom commented Jun 6, 2023

I guess the concern for larger test is that it takes more time to run CI. Can you post the time it takes to run this test?

The large test take 21.77s on my machine. I'll modify them to the cartesian product of the dispatch_list lhs and rhs, which will decrease the size.

This sounds like it's a performance issue, not a bug?

It is a performance issue. As mentioned, brute_force only supports pairwise operations up to 31x31, quadtree only supports pairwise operations up to ~10,000x10,000, which in neither case is CPU performance slow enough to really benefit from GPU acceleration.

I assume you mean input_types -> index_types. Nevertheless, if input_types is wrongly indexed due to it has an additional level of reindexing, it sounds like you can easily create a small reproducer that doesn't need large input?

Yes, sorry, input_types as per GeoMeta. I'm separating this PR into two separate PRs: one that handles the input_types slicing bug, and another that adds the additional binpred fixes relating to testing the cartesian product of the dispatch_list. However the latter PR will depend on the former PR, which doesn't necessarily make review easier.

Does the large input help developer easier locate the corner cases, or does it make it harder?

I can, I think, reproduce the effect of the large input just by building the cartesian product of the features provided in each dispatch_list, which will make it easier to locate corner cases.

@isVoid
Copy link
Contributor

isVoid commented Jun 6, 2023

I'm separating this PR into two separate PRs: one that handles the input_types slicing bug, and another that adds the additional binpred fixes relating to testing the cartesian product of the dispatch_list. However the latter PR will depend on the former PR, which doesn't necessarily make review easier.

If that's the case you can keep them in a single PR. Just make the name of the tests intuitive enough to identify the corner cases.

Does the large input help developer easier locate the corner cases, or does it make it harder?

I can, I think, reproduce the effect of the large input just by building the cartesian product of the features provided in each dispatch_list, which will make it easier to locate corner cases.

To be fair, it's good to have fuzzy tests like this that unearth bugs that are unnoticed with smaller inputs. It's just they are could be too expensive to run. If the fuzzy test can be equivalently replaced by a few corner cases, that would be awesome. If not, we need to think of the trade off between having them run in the CI versus the chances that they might catch additional regressions. Do you expect that future binpred development can benefit from having the fuzzy test run in the CI that can prevent you from unexpected regressions?

@harrism
Copy link
Member

harrism commented Jun 7, 2023

If not, we need to think of the trade off between having them run in the CI versus the chances that they might catch additional regressions.

We can also put the long-running "fuzzy" tests in nightly tests and skip them in PR tests.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing looks more thorough. Some minor suggestions on comments and one-liners, and one comment on the complexity of the boolean pip output reshaping...

python/cuspatial/cuspatial/core/binpreds/contains.py Outdated Show resolved Hide resolved
python/cuspatial/cuspatial/core/binpreds/contains.py Outdated Show resolved Hide resolved
python/cuspatial/cuspatial/core/binpreds/contains.py Outdated Show resolved Hide resolved
python/cuspatial/cuspatial/core/binpreds/contains.py Outdated Show resolved Hide resolved
python/cuspatial/cuspatial/core/binpreds/contains.py Outdated Show resolved Hide resolved
@harrism
Copy link
Member

harrism commented Jun 7, 2023

On a positive note, this PR actually seems to reduce total pytest time on my machine compared to before this PR.

23.04
=========================== 525 passed, 39 skipped, 81 warnings in 45.43s ============================
23.06
================ 953 passed, 19 skipped, 3 xfailed, 830 warnings in 294.19s (0:04:54) ================
23.08 (pre PR 1166)
================ 953 passed, 19 skipped, 3 xfailed, 830 warnings in 302.47s (0:05:02) ================
23.08 (post PR 1166)
=============== 1071 passed, 19 skipped, 3 xfailed, 927 warnings in 132.47s (0:02:12) ================

The growth in warnings is concerning, and I think it's something we should work to reduce.

thomcom and others added 12 commits June 7, 2023 09:10
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
…erly.py

Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
Co-authored-by: Mark Harris <783069+harrism@users.noreply.github.com>
@thomcom
Copy link
Contributor Author

thomcom commented Jun 7, 2023

I've handled all your reviews thus far.

@thomcom
Copy link
Contributor Author

thomcom commented Jun 8, 2023

/merge

@rapids-bot rapids-bot bot merged commit 1d2c174 into rapidsai:branch-23.08 Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currenty a work in progress bug Something isn't working non-breaking Non-breaking change Python Related to Python code
Projects
Status: Review
Development

Successfully merging this pull request may close these issues.

[BUG]: Sliced GeoSeries have invalid ._meta.input_types
3 participants