Single-Node Multi-GPU Training Stuck #6509
Replies: 6 comments 6 replies
-
I saw the following warning: So i tested with The GPUs are at 100%: But nothing happens....what it can be? |
Beta Was this translation helpful? Give feedback.
-
Finally it works when I use Horovod + Gloo interface |
Beta Was this translation helpful? Give feedback.
-
Thank you for your effort! I also tried your 'horovod' code and succeeded, but I wanted to use 'ddp'. And I finally found this, #4471 (comment) 'rank_zero_only=True' in self.log() function to solve this issue. |
Beta Was this translation helpful? Give feedback.
-
Having a similar issue. Even using the example MNIST code on the home page README but changing to |
Beta Was this translation helpful? Give feedback.
-
@andrewssobral I'm having pretty much the same exact issue that you had. I'm wondering if you were running on a SLURM HPC? And if so, how did you install horovod? |
Beta Was this translation helpful? Give feedback.
-
I recommend moving the code if not os.path.exists("MNIST"):
wget.download("https://activeeon-public.s3.eu-west-2.amazonaws.com/datasets/MNIST.new.tar.gz", "MNIST.tar.gz")
tar = tarfile.open("MNIST.tar.gz", "r:gz")
tar.extractall()
tar.close() to the prepare_data method. Otherwise you run the risk of race condition or corrupted files as multiple workers attempt to download these files. |
Beta Was this translation helpful? Give feedback.
-
Hello everyone!
I am trying to launch a single-node multi-gpu training script, but i don't get any warning/error message, and the script is stuck for long time, nothing occurs....screenshot below:
The script was launched in a multi-gpu node (4 GPUs Tesla K80), as you can see below:
nvidia-smi info header:
When the script is "running" , I have the following behavior in my nvtop:
I waited for several minutes (around 30min), and nothing happens, I still have the following output:
Please see below my source code:
I'm using the following setup ($ pip freeze):
Someone knows what's happening ? Something wrong in the source code? Thanks in advance!
Beta Was this translation helpful? Give feedback.
All reactions