Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put values and offsets on the same device on gpu #45

Merged
merged 2 commits into from
Nov 10, 2022

Conversation

edknv
Copy link
Contributor

@edknv edknv commented Nov 10, 2022

Hot fix for NVIDIA-Merlin/Transformers4Rec#526.

Implementation

  • With sparse tensors, create the tensor for values on the same device as the tensor for offsets.

Testing

python -m pytest tests/unit/loader/test_torch_dataloader.py
Manually tested the notebook from T4R as described in NVIDIA-Merlin/Transformers4Rec#526.
Ideally we probably want to set up automated tests for multi-gpu settings but it will probably have to be a follow-up work after the current release.

@edknv edknv added the bug Something isn't working label Nov 10, 2022
@edknv edknv self-assigned this Nov 10, 2022
@edknv edknv marked this pull request as ready for review November 10, 2022 02:38
@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/dataloader/review/pr-45

Comment on lines +26 to +27
passenv =
NR_USER
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tox blocks all environment variables by default. We need to pass this in order to skip this test that uses horovod. The test was skipped previously because horovod was not installed in the ci runner image, but horovod was recently installed in the ci runner image because of multi-gpu support in Models.

@edknv edknv requested review from jperez999, nzarif and rnyak November 10, 2022 02:47
@edknv edknv merged commit db6a1fe into NVIDIA-Merlin:main Nov 10, 2022
Copy link

@nzarif nzarif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with T4Rec multi-gpu training and the fix seems to work.

edknv added a commit that referenced this pull request Nov 14, 2022
* Put values and offsets on the same device on gpu

* pass environ var in tox
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants