Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Torchvision transform hangs when using python multiprocessing and model inference #4529

Open
theahura opened this issue Oct 3, 2021 · 3 comments

Comments

@theahura
Copy link

theahura commented Oct 3, 2021

🐛 Describe the bug

Torchvision transforms cause code to hang when using python multiprocessing and a model on inference.

In particular, I'm seeing hangs when using a CLIP model and entirely unrelated torch code running in a multiprocess. An issue was filed against the CLIP repository here (openai/CLIP#130) but I figured this should also be flagged on torchvision because I don't think this issue has to do with their model in particular. The model hangs on img.permute((2, 0, 1)).contiguous() on this line in transforms/functional.py. This is in turn called by the ToTensor transform at F.to_tensor(pic) on this line.

Minimal code sample in which I split the to_tensor transform and remove any unneeded parts:

import torch
import clip
from PIL import Image
import multiprocessing as mp

model = clip.model.CLIP(512, 224, 12, 768, 32, 77, 49408, 512, 8, 12)


def test():
  print("GETTING IMAGE")
  im = Image.open("CLIP.png")
  print("CONVERTING")
  im = im.convert('RGB')
  print("MADE TENSOR")
  img = torch.ByteTensor(torch.ByteStorage.from_buffer(im.tobytes()))
  print("VIEW")
  img = img.view(im.size[1], im.size[0], len(im.getbands()))
  print("PERMUTING")
  img = img.permute((2, 0, 1))
  print("CONTINGUOUS")
  img = img.contiguous()
  print("DIV")
  img = img.float().div(255)
  print("UNSQUEEZE")
  img = img.unsqueeze(0)
  return img


p = mp.Process(target=test, daemon=True)
p.start()
p.join()

This code will hang on the img.contiguous() call, but only if the model is initialized at the top. If the model at the top is commented out, this works as expected. Further, note that the multiprocess function does not even use the model.

Versions

Collecting environment information...
/home/amol/code/soot/debugging/clip_tests/env/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Jun  2 2021, 10:49:15)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.11.0-34-generic-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.0
[pip3] torch==1.7.1
[pip3] torchvision==0.8.2
[conda] Could not collect

cc @vfdev-5 @datumbox

@rsamf
Copy link

rsamf commented Mar 16, 2024

I'm dealing with the same problem 2+ years later

@ed-cho
Copy link

ed-cho commented May 22, 2024

same here... any updates on this?

@Adeniyilowee
Copy link

Adeniyilowee commented May 22, 2024

I am currently going through the same problem after building a new environment.. torch.tensor() seems to be the culprit.. everything also seems to work perfectly on old environment. Any update? Can anyone help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants