Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structure S3-Hosted Wheels as PyPI Repository #7494

Closed
reesehyde opened this issue Jun 29, 2024 · 9 comments · Fixed by #7514
Closed

Structure S3-Hosted Wheels as PyPI Repository #7494

reesehyde opened this issue Jun 29, 2024 · 9 comments · Fixed by #7514

Comments

@reesehyde
Copy link

🐛 Bug

When trying to construct a dgl.graphbolt.DataLoader in an environment supporting CUDA, the call to torch.ops.graphbolt.set_max_uva_threads() fails with an AttributeError

To Reproduce

From the environment described below, attempt to create a Graphbolt datapipe per the Node Classification with Minibatch Sampling tutorial. Note that while the environment supports CUDA, the error is produced even when the CPU is used:

from dgl import graphbolt as gb
import torch

device = torch.device("cpu")
dataset = gb.BuiltinDataset("ogbn-arxiv-seeds").load()
datapipe = gb.ItemSampler(dataset.tasks[0].train_set, batch_size=1024, shuffle=True)
datapipe = datapipe.sample_neighbor(dataset.graph, [4, 4])
datapipe = datapipe.copy_to(device)
datapipe = datapipe.fetch_feature(dataset.feature, node_feature_keys=["feat"])
dataloader = gb.DataLoader(datapipe)

This results in:

Traceback (most recent call last):
  File "/mnt/host_home/cash-identity-offline-graph-ml/hackweek/datapipe_bug.py", line 10, in <module>
    dataloader = gb.DataLoader(datapipe)
  File "/mnt/host_home/cash-identity-offline-graph-ml/hackweek/.venv/lib/python3.10/site-packages/dgl/graphbolt/dataloader.py", line 167, in __init__
    torch.ops.graphbolt.set_max_uva_threads(max_uva_threads)
  File "/mnt/host_home/cash-identity-offline-graph-ml/hackweek/.venv/lib/python3.10/site-packages/torch/_ops.py", line 822, in __getattr__
    raise AttributeError(
AttributeError: '_OpNamespace' 'graphbolt' object has no attribute 'set_max_uva_threads'

Expected behavior

DataLoader to be created successfully

Environment

  • DGL Version: 2.1.0
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 2.2.1
  • OS: Ubuntu 22.04
  • How you installed DGL: pip (Poetry)
  • Build command you used (if compiling from source): N/A
  • Python version: 3.10
  • CUDA/cuDNN version (if applicable): 11.8
  • GPU models and configuration (e.g. V100): T4
  • Any other relevant information: System arch x86-64

Additional context

I can confirm the graphbolt shared library is present for my PyTorch version:

$ ls .venv/lib/python3.10/site-packages/dgl/graphbolt | grep $(python -c "from torch import __version__ as torchver; print(torchver[:torchver.rfind('+')])")
libgraphbolt_pytorch_2.2.1.so

I'm not sure how to check whether PyTorch is loading it correctly or at all.

Other Versions

Relatedly, my first reaction was to try to a different version of DGL and/or PyTorch. But I found that installing from PyPI on an x86-64 Linux machine I'm restricted to only using version 2.1.0 for v2. On PyPI the 2.0.0 wheel is only available for Linux aarch64, and no Linux wheels are available for 2.2.0 or 2.2.1. Could the CI/CD be updated to build more Linux wheels? I'd love to contribute there if someone could point me in the right direction!

@mfbalin
Copy link
Collaborator

mfbalin commented Jun 30, 2024

Since you use the CPU as the device, you can pass overlap_feature_fetch=False to the DataLoader as a workaround.

@mfbalin
Copy link
Collaborator

mfbalin commented Jun 30, 2024

I think the main issue is caused by you having probably installing the CPU version of DGL instead of CUDA. Can you tell us what is your installed DGL version? You can report the version by pip.

@Rhett-Ying
Copy link
Collaborator

@reesehyde pls refer to this page for DGL installation. This is the official page you should refer to only. As for pip packages, we host them on AWS S3 by our own. We only uploaded CPU versions to PyPI only and we stop uploading since DGL 2.2.0. So please always fetch pip packages from AWS S3.

@reesehyde
Copy link
Author

Ah apologies, the problem was indeed using the CPU version! I just had plain old 2.1.0. Thank you @mfbalin and @Rhett-Ying for the help!

I managed to install this by downloading the correct wheel manually but have to fetch packages through a PyPI proxy. Would the team consider setting up the S3 bucket to be indexable by pip? I don't know exactly what that entails but looking through torch's bucket setup and testing some index urls, just hosting the repo.html file as a file called dgl might be sufficient? Then a fetch for dgl version 2.3.0 with index url https://data.dgl.ai/wheels/torch-2.3/cu118 would look for a version list file (repo.html) at https://data.dgl.ai/wheels/torch-2.3/cu118/dgl/.

@mfbalin
Copy link
Collaborator

mfbalin commented Jul 2, 2024

Ah apologies, the problem was indeed using the CPU version! I just had plain old 2.1.0. Thank you @mfbalin and @Rhett-Ying for the help!

I managed to install this by downloading the correct wheel manually but have to fetch packages through a PyPI proxy. Would the team consider setting up the S3 bucket to be indexable by pip? I don't know exactly what that entails but looking through torch's bucket setup and testing some index urls, just hosting the repo.html file as a file called dgl might be sufficient? Then a fetch for dgl version 2.3.0 with index url https://data.dgl.ai/wheels/torch-2.3/cu118 would look for a version list file (repo.html) at https://data.dgl.ai/wheels/torch-2.3/cu118/dgl/.

Maybe you can update the issue title now that we know what is going wrong.

@reesehyde reesehyde changed the title set_max_uva_threads Pytorch Op Not Found Structure S3-Hosted Wheels as PyPI Repository Jul 2, 2024
@reesehyde
Copy link
Author

reesehyde commented Jul 2, 2024

Thanks @mfbalin, updated the title to reflect the new request. I read up a bit more on hosting a simple PyPI repository and it does look like simply hosting an index file at the /dgl path should do the trick!

I'd be happy to create a PR for the update if someone could point me towards the S3-publish logic. I searched around in the repo for "repo.html" and "s3" but only found the CI/CD report and log uploads.

@mfbalin
Copy link
Collaborator

mfbalin commented Jul 3, 2024

Thanks @mfbalin, updated the title to reflect the new request. I read up a bit more on hosting a simple PyPI repository and it does look like simply hosting an index file at the /dgl path should do the trick!

I'd be happy to create a PR for the update if someone could point me towards the S3-publish logic. The searched around in the repo for "repo.html" and "s3" but only found the CI/CD report and log uploads.

@Rhett-Ying What do you think? I don't understand much from PyPI or pip.

@Rhett-Ying
Copy link
Collaborator

@reesehyde could you show me the use case you want and the blocker? why current install command pip install dgl -f https://data.dgl.ai/wheels/torch-2.1/repo.html does not work for you? How would you like to install DGL? specify in a yaml?

@reesehyde
Copy link
Author

reesehyde commented Jul 6, 2024

Thanks @Rhett-Ying, I hadn't tried that command but you're right that it does the trick in pip — I wasn't aware of the -f HTML page instead of -i Python Package Index option in pip! The case I had in mind was essentially using -i rather than -f, which requires a proper PyPI index. This could be established by hosting the /repo.html file at /dgl and we could then use pip -i https://data.dgl.ai/wheels/torch-2.3/cu118 instead of pip -f https://data.dgl.ai/wheels/torch-2.3/cu118/repo.html.

But I'm using Poetry rather than pip, and it seems my issue is simply due to a bug in Poetry. When specifying the /repo.html page as an a source URL the result is e.g.:

403 Client Error: Forbidden for url: https://data.dgl.ai/wheels/torch-2.3/cu118/repo.html/dgl-2.3.0%2Bcu118-cp310-cp310-manylinux1_x86_64.whl

Poetry's Single Page Link Source forces /repo.html to a folder and then tries to build the relative link from it as /repo.html/file.whl. It is supposed to support the single HTML index page so I'll just fix the bug there — thank you both for getting me pointed in the right direction!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants