-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace cudf _from_columns with a public API #1326
Replace cudf _from_columns with a public API #1326
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question.
@@ -91,7 +91,7 @@ def directed_hausdorff_distance(multipoints: GeoSeries): | |||
as_column(multipoints.multipoints.geometry_offset[:-1]), | |||
) | |||
|
|||
return DataFrame._from_columns(result, range(num_spaces)) | |||
return DataFrame(dict(enumerate(result))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this as performant as before? It's definitely less obvious what is happening compared to before, so perhaps a 1-liner doc? As a non-Python-native, is this considered Pythonic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a little more overhead due to validation checks in DataFrame
as opposed to DataFrame._from_columns
, but this is the most equivalent public API to the previous line. Here's the performance of this patch with 500k points
In [1]: from shapely.geometry import MultiPoint
...:
In [2]: import cuspatial
...:
In [3]: data = [[(0, 0)] * 500_000]
In [4]: s = cuspatial.GeoSeries([MultiPoint(coords) for coords in data])
# PR
In [5]: %timeit cuspatial.directed_hausdorff_distance(s)
555 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As a non-Python-native, is this considered Pythonic?
I would say so yes. I will add a 1 line comment about what this line is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dict(enumerate(...))
is a Pythonic way to construct a dictionary of the form {i: values[i]}
from a sequence values
. This constructor is less performant than the old approach. The more performant choice here would be to use DataFrame._from_data(dict(enumerate(...)))
. I assume Matt picked this alternative to avoid using an internal method. The reality though is that cuspatial hooks heavily into cudf internals at the Python layer. Until we get around to a more thorough refactoring of cudf to better expose the functionality cuspatial relies on, it's probably OK to continue using cudf internals for better perf. I'm OK with either though and will leave it up to Matt to decide what to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there's an understanding that cuspatial uses internal cudf APIs already and performant, public APIs don't exist for cuspatial yet, I'll switch this to the more performant _from_data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I think we should remove internal cuDF calls from cuSpatial except in cases where there is a significant performance benefit. In other cases the maintenance cost is not worthwhile.
I don't know what the performance benefit is here, I think you only showed the performance of the dict() approach above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK actually it appears validation through the public cudf.DataFrame
doesn't scale well with wide frames. Not sure what is a representative scale for directed_hausdorff_distance
is, but for a 1000 x 1000 point comparison, this routine's microbenchmark is equivalent to
In [1]: import cudf
In [2]: columns = [cudf.core.column.as_column(range(1000), dtype="float64")] * 1000
In [3]: %timeit cudf.DataFrame._from_data(dict(enumerate(columns)))
371 µs ± 6.96 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [4]: %timeit cudf.DataFrame(dict(enumerate(columns)))
11.3 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So let's keep _from_data
here for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I think we should remove internal cuDF calls from cuSpatial except in cases where there is a significant performance benefit. In other cases the maintenance cost is not worthwhile.
I agree, but IIRC there are also cases where cuSpatial uses cuDF internals because there is no public-facing API for doing what cuSpatial needs. For example, cuDF is not really designed for extensible sets of dtypes, so cuSpatial uses internal methods to set up usage of the GeoColumn because there is no public access to column-level APIs otherwise. We hope to make that properly extensible at some point, but that's a long ways away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another main reason is that GeoColumn/GeoDataFrame extends Column/DataFrame so it depends on a lot of their internals. Removing the inheritance and using just the public APIs from cuDF may give us what we need minus the performance cost. (Use a Series everywhere a column op might come into play).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be possible but would have a significant performance cost. We can evaluate when we get around to the appropriate stage of cudf refactoring.
/merge |
Description
This private API will be removed in 24.02
Checklist