Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Faster dataframe to cupy conversion when dataframe is a single allocation #12928

Open
beckernick opened this issue Mar 13, 2023 · 4 comments
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request Python Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

beckernick commented Mar 13, 2023

When we convert a dataframe to a cupy array, we iterate over each column (as they’re independent allocations) and assign each one to a column in an empty matrix. This means it can be slow for thousands or millions of small columns.

In a select set of circumstances, all of the columns in a DataFrame may be part of a single, contiguous allocation of memory. One scenario in which this can occur is after a call to transpose. It would be nice if, in this scenario, we didn't need to iterate over every column when converting to a cupy array.

A real-world example of when this can matter is if a user is trying to run a dot product after a calling transpose. Because of the bottleneck, we're slower than pandas by quite a bit.

import cudf
import cupy as cp
import pandas as pd

nrows = 10000
ncols = 4

gdf = cudf.DataFrame(cp.random.randint(0, 1000, size=(nrows, ncols)))
pdf = gdf.to_pandas()

%time gdf.T.dot(gdf)
%time pdf.T.dot(pdf)
CPU times: user 1.52 s, sys: 3.96 ms, total: 1.53 s
Wall time: 1.53 s
CPU times: user 912 µs, sys: 41 µs, total: 953 µs
Wall time: 855 µs

If we were to do any special casing here, we'd want to closely evaluate any impact on performance for the more general case, as the dataframe to cupy codepath is used across the board.

@beckernick beckernick added feature request New feature or request Needs Triage Need team to review and classify labels Mar 13, 2023
@shwina
Copy link
Contributor

shwina commented Mar 13, 2023

The resulting CuPy array would need to be F-contiguous, IIUC. Would that be alright?

@wence-
Copy link
Contributor

wence- commented Mar 15, 2023

The resulting CuPy array would need to be F-contiguous, IIUC. Would that be alright?

Yes, since cupy dot handles that fine.

The right answer here (since the .T is already expensive) is to do:

tmp = gdf.values
result = cudf.DataFrame(tmp.T.dot(tmp))

But that is certainly not as seamless

@beckernick
Copy link
Member Author

Agreed, F contiguousness would be fine. With that said, if we decide that a fastpath for this sceniaro has a non-trivial cost (in performance or complexity) associated with it and isn't worth the cost, that's also reasonable.

For completeness, when the user hit this originally they switched to CuPy to do the operations (like Lawrence suggested).

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 6, 2023
@vyasr
Copy link
Contributor

vyasr commented May 17, 2024

Related: #11648

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request Python Affects Python cuDF API.
Projects
Status: Todo
Status: In Progress
Development

No branches or pull requests

5 participants