Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Provide a way/example to use cuCIM with Dask for DataLoading(Pytorch's Dataloader-like API) #99

Open
gigony opened this issue Sep 7, 2021 · 4 comments
Assignees
Labels
feature request New feature or request

Comments

@gigony
Copy link
Contributor

gigony commented Sep 7, 2021

Is your feature request related to a problem? Please describe.

PyTorch's DataLoader class is used in many DeepLearning training applications to load training data and pre-process the data, before feeding to AI model.

Since PyTorch's DataLoader is running in multi-processes, it is hard to use cuCIM's scikit-image APIs (which makes use of CUDA) in the pre-transformations of the DataLoader due to CUDA context issues.

It would be nice to provide a way/example to use cuCIM with DeepLearning Frameworks such as PyTorch.

Describe the solution you'd like

PyTorch's DataLoader works like this. It would be nice if we have a PyTorch's DataLoader-like utility class in Dask that mimics Pytorch's DataLoader behavior but implemented with Dask (dask-cuda) for the parallelization of data loading (so providing a generator/iterator that gives a batch of processed image data).

Describe alternatives you've considered

  • To use cuCIM in the training pipeline, we currently move GPU-accelerated pre-transforms from PyTorch DataLoader's transformation(using Compose) to the main thread (place GPU-based batch pre-transformation right before feeding to the AI model, and right after getting CPU-loaded/pre-transformed training data by DataLoader.), to avoid CUDA context issues.
  • It would be good if we also provide an example with that approach.

Additional context

Relevant information regarding CuPy+PyTorch.

With Numba to get cuda context.

@gigony gigony added the feature request New feature or request label Sep 7, 2021
@NV-jpt
Copy link

NV-jpt commented Sep 14, 2021

I would be happy to take a look at this!

@rjzamora
Copy link
Member

NVTabular's pytorch dataloader (which is built on Dask) may be agood reference here. I'll be happy to help advise on this work and clarify what NVTabular is (and isn't) doing.

@NV-jpt
Copy link

NV-jpt commented Sep 22, 2021

Thank you, @rjzamora !

@NV-jpt
Copy link

NV-jpt commented Dec 16, 2021

Unfortunately, I have come across some issues with the cuCIM codebase that are blocking this Dask-based DataLoading solution.

For example, while the array transformation in the cuCIM transform image_rotate_90() should be compatible with a Dask Array input, there is an explicit Type-Check here that throws a TypeError whenever trying to apply cuCIM transforms to a Dask Array.

In order to allow Dask to schedule cuCIM operations - we will likely want to make this "check" more of a duck-type check, that checks for the necessary API interface, as Dask Arrays should be able to pass those checks.

@caryr35 caryr35 added this to cucim Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Status: No status
Development

No branches or pull requests

4 participants