Make isinstance check pass for proxy ndarrays #16601

Matt711 · 2024-08-19T18:57:48Z

Description

Closes #14537.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

python/cudf/cudf/pandas/proxy_base.py

… into feat/ndarray-instance-check

Matt711 · 2024-08-26T22:41:50Z

/ok to test

Matt711 · 2024-08-27T00:18:24Z

This PR fixes the two issues with #16286 (ie. the previous version of this PR that we reverted). The two issues were:

In the __array_finalize__(self, obj), the _fsproxy_wrapped attribute is set equal to obj, which could be None. In this PR, there's a check if (obj is None): return to prevent that. I ran the cudf-pandas-integration tests, and it seems to fix the tests that were failing from the previous version of this PR. Once [FEA] Add third-party library integration testing of cudf.pandas to cudf #16645 is merged , we'll be able to see the third-party integration tests passing in cudf.
In __new__, there was an initial DtoH transfer when a cp.ndarray was passed to the proxy array class. This PR doesn't include the DtoH transfer. Instead, in __new__, we're creating a "garbage" np.ndarray with the correct shape and dtype, but with incorrect/garbage data.

The fix for 2. leads to a problem which the only way I can think of solving is via monkeypatching. The problem is that now that our proxy array passes the check isinstance(proxy, np.ndarray), there are many numpy functions (which will call NumPy C functions) that will assume that they can use the array's buffer instead instead of going through a double underscore method like __array__. This is a problem because our proxy array's buffer has garbage data, so the function will produce garbage data.

Take np.asarray for an example. It will eventually lead to some like code like this. Our proxy array passes the PyArray_Check but gives incorrect data in the PyArray_DATA call.

if (PyArray_Check(obj)) {
    void* buffer = PyArray_DATA(obj) // access the buffer directly
}

I think the only way to avoid this problem from Python is to monkeypatch every function like np.asarray. So far I've found several functions we'll need to patch: np.asarray, np.where, np.outer, np.inner, and several other functions in the np.linalg module.

There is at least one other function in a third-party library which we'll need to patch (torch.from_numpy). This function function could be patched in the same way the numpy functions are.

Numba CPU dispatched functions also access the proxy arrays buffer. The good news is that eventually numba will support compiling objects which implement __array__ (see numba/9584). This bad news is that this probably won't be ready before 24.10. So in the meantime, we would have to let the user know that using proxy arrays directly in CPU dispatched functions will return incorrect results. If there's not a way to fail loudly in this case, then it should at least be documented. FYI, we currently fail when using proxy arrays in CPU dispatched functions because we fail isinstance check. So if its possible to continue failing, that's not the worst thing to until the next Numba release.

What do you all think of the monkey patching approach?

Another idea would be pay the cost of the DtoH transfer upfront (like in the previous PR). The DtoH is only done on instance creation. And then after that, the fast-slow proxy mechanism is used. The pro here is that the buffer will be set correctly, and thus no monkeypatching. The con is obviously the DtoH transfer.

copy-pr-bot · 2024-08-27T14:40:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Matt711 · 2024-08-27T14:41:04Z

/ok to test

vyasr · 2024-08-27T18:04:10Z

Summary of offline discussion: monkey-patching numpy to make this work is a bridge too far without much stronger motivations, and we will probably just go with the eager D2H copy instead.

…rray-instance-check

Matt711 · 2024-08-31T03:09:58Z

Ping @vyasr for a review next week

bdice · 2024-09-03T15:51:08Z

@Matt711 The PR description says "do not merge" but there is no "DO NOT MERGE" label. Can you make this consistent?

Also for team knowledge, the "Description" section of the PR body is used in the final commit message when the PR is merged. Temporary information like the PR state or benchmarks are better to put in comments rather than the "Description" section.

…rray-instance-check

vyasr

Couple of suggestions for improvement, but assuming you don't object to applying them I don't need to review again.

python/cudf/cudf/pandas/fast_slow_proxy.py

python/cudf/cudf/pandas/proxy_base.py

python/cudf/cudf_pandas_tests/test_cudf_pandas.py

… into feat/ndarray-instance-check

Matt711 · 2024-09-04T22:24:44Z

/ok to test

.github/workflows/pr.yaml

galipremsagar · 2024-09-04T22:25:40Z

/okay to test

galipremsagar · 2024-09-05T13:10:19Z

/okay to test

Matt711 · 2024-09-05T13:47:59Z

/merge

Closes rapidsai#14537. Authors: - Matthew Murray (https://github.com/Matt711) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#16601

The torch test should no longer fail after #16601. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - James Lamb (https://github.com/jameslamb) - Matthew Roeschke (https://github.com/mroeschke) URL: #16705

Proxy numpy arrays now instances of real numpy arrays (#16601), so libraries (eg. numba, torch) which utilize NumPy's C API should now be able to use proxy arrays. This PR updates the cudf.pandas documentation to reflect this. Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #16697

Closes rapidsai#14537. Authors: - Matthew Murray (https://github.com/Matt711) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#16601

Make isinstance check pass for proxy ndarrays

7136fb9

Matt711 added feature request New feature or request Python Affects Python cuDF API. 5 - DO NOT MERGE Hold off on merging; see PR for details non-breaking Non-breaking change cudf.pandas Issues specific to cudf.pandas labels Aug 19, 2024

Matt711 self-assigned this Aug 19, 2024

Matt711 requested a review from a team as a code owner August 19, 2024 18:57

Matt711 requested review from wence- and charlesbluca August 19, 2024 18:57

Matt711 commented Aug 19, 2024

View reviewed changes

python/cudf/cudf/pandas/proxy_base.py Outdated Show resolved Hide resolved

Matt711 requested review from vyasr, mroeschke and galipremsagar August 19, 2024 19:04

Matt711 added 3 commits August 20, 2024 09:58

make asarray use wrapped array

dcc806f

tackle ufuncs

e7b8948

refactor

a1bee53

Matt711 removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Aug 22, 2024

Matt711 and others added 5 commits August 22, 2024 08:16

Merge branch 'feat/ndarray-instance-check' of github.com:Matt711/cudf…

9a05ab1

… into feat/ndarray-instance-check

Merge branch 'branch-24.10' into feat/ndarray-instance-check

59adfe7

Merge branch 'branch-24.10' into feat/ndarray-instance-check

f0c33f9

Merge branch 'feat/ndarray-instance-check' of github.com:Matt711/cudf…

9a33199

… into feat/ndarray-instance-check

monkeypatch np.dot

40f3e14

rapidsai deleted a comment from copy-pr-bot bot Aug 26, 2024

device is a kwarg

216aeb1

Merge branch 'branch-24.10' of github.com:rapidsai/cudf into feat/nda…

4ef0abb

…rray-instance-check

Matt711 mentioned this pull request Aug 30, 2024

Remove xfail from torch-cudf.pandas integration test #16705

Merged

3 tasks

mroeschke approved these changes Aug 30, 2024

View reviewed changes

Matt711 added 2 commits September 3, 2024 13:38

cleanup

78bc30a

Merge branch 'branch-24.10' of github.com:rapidsai/cudf into feat/nda…

a4892be

…rray-instance-check

vyasr approved these changes Sep 3, 2024

View reviewed changes

python/cudf/cudf/pandas/fast_slow_proxy.py Outdated Show resolved Hide resolved

python/cudf/cudf/pandas/proxy_base.py Outdated Show resolved Hide resolved

python/cudf/cudf_pandas_tests/test_cudf_pandas.py Show resolved Hide resolved

Matt711 and others added 4 commits September 4, 2024 07:12

address review

8c8bc3e

Merge branch 'branch-24.10' into feat/ndarray-instance-check

232c9c3

test third-party integration tests

9e7e3de

Merge branch 'feat/ndarray-instance-check' of github.com:Matt711/cudf…

9eb8297

… into feat/ndarray-instance-check

Matt711 requested a review from a team as a code owner September 4, 2024 22:24

Matt711 requested a review from msarahan September 4, 2024 22:24

Matt711 commented Sep 4, 2024

View reviewed changes

.github/workflows/pr.yaml Outdated Show resolved Hide resolved

Merge branch 'branch-24.10' into feat/ndarray-instance-check

3eee8d1

Matt711 and others added 2 commits September 4, 2024 18:20

remove pr job

e2047e0

Merge branch 'branch-24.10' into feat/ndarray-instance-check

2ec3ad1

rapids-bot bot merged commit e1ab1e7 into rapidsai:branch-24.10 Sep 5, 2024
85 checks passed

Matt711 mentioned this pull request Sep 5, 2024

[BUG] cudf.pandas wrapped numpy arrays not compatible with numba #15694

Closed

This was referenced Oct 2, 2024

[FEA] Improve support or failure modes for numpy and other libraries with C APIs in cudf.pandas #15397

Open

[DOC] Document limitation using cudf.pandas proxy arrays #16955

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make isinstance check pass for proxy ndarrays #16601

Make isinstance check pass for proxy ndarrays #16601

Matt711 commented Aug 19, 2024 •

edited

Loading

Matt711 commented Aug 26, 2024

Matt711 commented Aug 27, 2024 •

edited

Loading

copy-pr-bot bot commented Aug 27, 2024

Matt711 commented Aug 27, 2024

vyasr commented Aug 27, 2024

Matt711 commented Aug 31, 2024

bdice commented Sep 3, 2024

vyasr left a comment

Matt711 commented Sep 4, 2024

galipremsagar commented Sep 4, 2024

galipremsagar commented Sep 5, 2024

Matt711 commented Sep 5, 2024

Make isinstance check pass for proxy ndarrays #16601

Make isinstance check pass for proxy ndarrays #16601

Conversation

Matt711 commented Aug 19, 2024 • edited Loading

Description

Checklist

Matt711 commented Aug 26, 2024

Matt711 commented Aug 27, 2024 • edited Loading

copy-pr-bot bot commented Aug 27, 2024

Matt711 commented Aug 27, 2024

vyasr commented Aug 27, 2024

Matt711 commented Aug 31, 2024

bdice commented Sep 3, 2024

vyasr left a comment

Choose a reason for hiding this comment

Matt711 commented Sep 4, 2024

galipremsagar commented Sep 4, 2024

galipremsagar commented Sep 5, 2024

Matt711 commented Sep 5, 2024

Matt711 commented Aug 19, 2024 •

edited

Loading

Matt711 commented Aug 27, 2024 •

edited

Loading