-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove NumPy <2 pin #4615
Remove NumPy <2 pin #4615
Conversation
Updating branch to trigger CI. I think RAPIDS dependencies should now be resolved sufficiently to actually get NumPy 2 (not sure about all runs, as torch may get in the way depending on the version). EDIT: No, the |
Sorry, this comment was meant for the wholegraph PR. (Although it still means wholegraph doesn't matter for this one... |
Updating branch to pull in the latest upstream changes and restart CI now that cuDF is done: rapidsai/cudf#16300 |
Documenting for posterity Seeing some unrelated C++ failures on CI: The following tests FAILED:
50 - EDGE_TRIANGLE_COUNT_TEST (Failed)
51 - LOOKUP_SRC_DST_TEST (Failed)
52 - K_HOP_NBRS_TEST (Failed) Will restart the failed job after CI completes |
Just to note, the test failures seem all due to an out of memory.
|
So potentially related to general CI issues we have seen: rapidsai/cuml#6031 (comment) |
See rapidsaigh-4486 for adding torchdata/pydantic and rapidsaigh-4464 for introducing the torch pin due to NCCL difficulties. It is entirely possible that this still fails due to the NCCL issue!
dependencies.yaml
Outdated
depends_on_pytorch: | ||
common: | ||
- output_types: [conda] | ||
packages: | ||
- &pytorch_unsuffixed pytorch>=2.0,<2.2.0a0 | ||
- torchdata | ||
- pydantic | ||
|
||
specific: | ||
- output_types: [requirements] | ||
matrices: | ||
- matrix: {cuda: "12.*"} | ||
packages: | ||
- --extra-index-url=https://download.pytorch.org/whl/cu121 | ||
- matrix: {cuda: "11.*"} | ||
packages: | ||
- --extra-index-url=https://download.pytorch.org/whl/cu118 | ||
- {matrix: null, packages: null} | ||
- output_types: [requirements, pyproject] | ||
matrices: | ||
- matrix: {cuda: "12.*"} | ||
packages: | ||
- &pytorch_pip torch>=2.0,<2.2.0a0 | ||
- *tensordict | ||
- matrix: {cuda: "11.*"} | ||
packages: | ||
- *pytorch_pip | ||
- *tensordict | ||
- {matrix: null, packages: [*pytorch_pip, *tensordict]} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it is ok to keep this. We just want to relax the upper bound. At a minimum we need PyTorch 2.3.0 for NumPy 2 support ( pytorch/pytorch#107302 ). Though wonder if this upper bound can be dropped altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we have cupy>13.2 in our test environment we are ok to remove the upper bound
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also ok if we enforce pytorch >=2.3 for both test and runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the discussion about what to do with the pytorch
dependency in one thread, for ease of linking.
@hcho3 would >=2.3.0
, as @alexbarghi-nv suggested here, work? A floor would be preferable to a ceiling, I think...otherwise this will have to be changed again in the next release.
Note: we should match whatever's done here in rapidsai/wholegraph#218
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because 2.4.0 has issues with Conda-forge packaging. See conda-forge/pytorch-cpu-feedstock#254 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Sorry, I did see you'd posted that link but misunderstood what it contained. Ok great, I think a ceiling on 2.4.0 is ok here. Let's see what @alexbarghi-nv says.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was recently resolved. So have submitted PR ( #4703 ) to relax this in 24.12
dependencies.yaml
Outdated
@@ -662,10 +658,12 @@ dependencies: | |||
- output_types: [conda, pyproject] | |||
packages: | |||
- pandas | |||
- pydantic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what pydantic
is being used for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DGL requires it. It tries to import it and raises an exception if it isn't available. It's used by GraphBolt.
Saw a couple test failures, which look like this on CI: FAILED tests/data_store/test_property_graph.py::bench_extract_subgraph_for_rmat[df_type=cudf.DataFrame] - NameError: name 'np' is not defined
FAILED tests/data_store/test_property_graph.py::bench_extract_subgraph_for_rmat[df_type=pandas.DataFrame] - NameError: name 'np' is not defined However the test file referenced has this definition
So not sure why that |
Have gone ahead and restarted CI |
Seems those failures did not clear up, although it isn't clear to me why (or if) they are related to these changes. |
Including the full traceback of one such error with cuDF on CI Note there is also a Pandas failure, which looks identical ___________ bench_extract_subgraph_for_rmat[df_type=cudf.DataFrame] ____________
gpubenchmark = <pytest_benchmark.fixture.BenchmarkFixture object at 0x7fe8f938f950>
rmat_PropertyGraph = (<cugraph.experimental.PropertyGraph object at 0x7fe90cf59410>, src dst weight
0 632844 ... 543777 0.816715
16777214 218373 734889 0.940356
16777215 762125 1012603 0.426682
[16777216 rows x 3 columns])
def bench_extract_subgraph_for_rmat(gpubenchmark, rmat_PropertyGraph):
from cugraph.experimental import PropertyGraph
(pG, generated_df) = rmat_PropertyGraph
scn = PropertyGraph.src_col_name
dcn = PropertyGraph.dst_col_name
verts = []
for i in range(0, 10000, 10):
verts.append(generated_df["src"].iloc[i])
> selected_edges = pG.select_edges(f"{scn}.isin({verts}) | {dcn}.isin({verts})")
tests/data_store/test_property_graph.py:2583:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/opt/conda/envs/test/lib/python3.11/site-packages/cugraph/structure/property_graph.py:1555: in select_edges
selected_col = eval(expr, globals, locals)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E NameError: name 'np' is not defined
<string>:1: NameError Seems to be pointing to this line, which uses
Best guess is the globals need to include some
Edit: To provide summary of the resolution The issue was NumPy 2 changed the scalar representation as part of NEP 51. As a result, some of the values passed to this To fix this Rick, added logic to cast these to Python Edit 2: To give an idea of how this occurred, this overly simplified example may be instructive In [1]: import numpy as np
In [2]: L = list(np.arange(3))
...: print(L)
[np.int64(0), np.int64(1), np.int64(2)]
In [3]: eval_globals = {}
...: eval_locals = {}
...:
...: eval(f"sum({L})", eval_globals, eval_locals)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 4
1 eval_globals = {}
2 eval_locals = {}
----> 4 eval(f"sum({L})", eval_globals, eval_locals)
File <string>:1
NameError: name 'np' is not defined Some resolutions to this would be
Rick went with option 3 below |
Thanks Philip! 🙏 Think this is looking pretty good Would be great to hear from James and Alex as well 🙂 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving based on what I see in the diff and my own search through the repo... I don't see any references that were missed, and I do believe the new floors on pytorch
+ NCCL will be enough to fix the issues we saw.
But let's please not merge until we see NumPy 2 in the logs of some test jobs (especially conda-python-tests
).
Two of the test jobs errored during source checkout. One of the other builds got hung. The rest of the jobs passed. Canceled to clear out the hung job. Have now restarted those 3 jobs |
That said, one of the Conda test jobs does show NumPy 2 is installed
|
This wheel test job was failing, so I restarted it:
It's been running for 3+ hours and I see a ton of error there (no logs with specific tracebacks reported yet, but it's 98% done so close). e.g. like this:
The other
We'll know more once the job completes and we can see tracebacks, but I think that needs some investigation. |
I'm seeing similar failures in some of the other PRs. Usually these pass and it's the PyG tests that fail. But the failure doesn't seem PyG specific. Is there any chance we pulled the wrong version of dask somewhere? |
Maybe! The errors do look Dask-y
It seems that in that job, we got these:
Same versions I see in the successful wheel-testing job :/ |
Well before that we had a wheel build that was taking a really long time. Honestly am wondering if there is just a bad node in the GHA pool that needs a reboot or perhaps exclusion until it can be repaired |
Rerunning the failed job again The fact that there are two test with slightly different configurations running this test suite with one passing and the other failing is odd. It would be easier to believe Dask was the issue if both jobs were failing (or even more jobs were affected) Given the flakiness we have seen here with CI, suspect there is an issue in the CI pool itself. Flagging this offline for follow up with the appropriate team |
Am noticing this looks a little out-of-date relative to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-approving. Seeing all CI passing again on a re-run, and NumPy 2 being used in all the wheel and conda test jobs, has me feeling confident that this is working as intended.
I also git grep
'd around one more time to check for missed references to NumPy, torch/PyTorch, and NCCL... didn't find any.
From my perspective, this can be merged.
There are several other PRs in Given that and @alexbarghi-nv 's written approval of the main changes here in the threads on this PR, I'm confident we can merge this, so I'm going to merge it. @alexbarghi-nv please |
/merge |
Thanks all! 🙏 |
As the issue around PyTorch being built without NumPy was fixed in conda-forge, we can now relax these upper bounds to allow PyTorch 2.4. xref: conda-forge/pytorch-cpu-feedstock#254 xref: conda-forge/pytorch-cpu-feedstock#266 xref: rapidsai/cugraph#4615 xref: rapidsai/cugraph#4703 xref: #59 Authors: - https://github.com/jakirkham Approvers: - Jake Awe (https://github.com/AyodeAwe) - Tingyu Wang (https://github.com/tingyu66) URL: #75
As the issue around PyTorch being built without NumPy was fixed in conda-forge, we can now relax these upper bounds to allow PyTorch 2.4. xref: conda-forge/pytorch-cpu-feedstock#254 xref: conda-forge/pytorch-cpu-feedstock#266 xref: #4615 Authors: - https://github.com/jakirkham - Alex Barghi (https://github.com/alexbarghi-nv) Approvers: - Alex Barghi (https://github.com/alexbarghi-nv) - James Lamb (https://github.com/jameslamb) URL: #4703
This PR removes the NumPy<2 pin which is expected to work for
RAPIDS projects once CuPy 13.3.0 is released (CuPy 13.2.0 had
some issues preventing the use with NumPy 2).