-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Launching train/train.py directly without Slurm #161
Comments
just use export CUDA_VISIBLE_DEVICES=0,1
export PYTHONPATH=absolute/workspace/directory
python -m torch.distributed.launch --nproc_per_node=2 dinov2/train/train.py --config-file=myconfig.yaml --output-dir=my_outputdir --use-env |
@usryokousha Thanks!
In fact, feeding By the way, I had to add the following in
Multi-GPU training definitely works, but weirdly it shows the |
It's just a logging issue, it displays the batch size per gpu; maybe we can put a better name |
@patricklabatut and @usryokousha |
Look into the code: dinov2/distributed/init.py, simply changes the:
then it works. Or as vladchimescu mentioned, add one argument:
and start your training with:
Hope it helps :) |
both work, but |
Hi, I try to launch train/train.py directly without Slurm using your methods. |
请问可以公布多一点多GPU训练的修改细节吗,两种方式均尝试后,我的代码仍然只使用单张GPU进行训练
|
Maybe you need to set the sampler type from INFINITE to DISTRIBUTED |
I had the same issue for multi node training but not for multi gpu within one node |
I depends on your inter-node connectivity |
I have infiniband for the internode connection, but I check the whole training process, the infiniband is not really used. I wondered if I dont have slurm in the cluster, how can I enable the distributed training with the same speed(or at least comparable)? Thank you |
if infiniband is not used, maybe there is a problem with the cluster configuration ? i think maybe you could copy-paste the pytorch distributed initialization functions from a setup that you are sure works on your cluster |
It seems strange to me that it would be REALLY slow. It may not be a bad idea to use DISTRIBUTED instead of INFINITE due to some slow down per process in INFINITE.I wouldn’t expect a major difference though. You could just launch through SLURM however for single node.나의 iPhone에서 보냄2023. 11. 9. 오전 12:24, GravityZL ***@***.***> 작성:
Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?
I had the same issue for multi node training but not for multi gpu within one node
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thank you! I have solved the issue, it is indeed the cluster is not properly setup |
I was able to run train.py directly without SLURM thanks to this thread. However, now I am faced with the challenge of trying to use 2 GPUs to train the model on my dataset. When running the training, my reports show me that the 2nd GPU isn't being used at all if not very little. My question: is there any other change I need to make to the training script to ensure it uses more than 1 GPU? Thanks |
Same issue here @adipill04. I'm using torchrun:
It seems to only train on a single GPU. Did you find a solution for this? |
Heyy, I am also facing the problem that the training with 8 GPUs is slower or equal to the training with 2 GPUs. I tried both INFINITE and DISTRIBUTED sampler, but neither works. Any idea to solve the problem? |
Are you using a cluster like SLURM? Are you also configuring an environment variable called 'NCCL_P2P_DISABLE'? I ran into some similar issues as you with multi gpu training not speeding up the training process and had to disable p2p communication for it to even start. The underlying issue in my case was a wrong hardware configuration. I'd suggest using weights and biases (wandb.ai) to track model training metrics if you don't already. It might be useful for understanding what your bottleneck is. |
Heyy Thanks for your reply @adipill04 . No I am not using slurm cluster, rather a DDP torchrun framework. Within the node, GPUs are communicated with each other via NV-Link. I tuned the num_workers and found that the speed seems more normal now. |
Great! To add on, for num_workers I felt setting it to 4 * # of gpus yielded fastest speeds on dataloader side of things |
Yes, that makes sense when you have an efficient data loader pipeline! |
+1 |
I am also having speed issues, multi node speed is way worse than single node. Its almost like the speed reduces linearly with increasing nodes |
Hi,
I am trying to launch
dinov2/train/train.py
script directly without the Slurm scheduler. I use the following command to launch the training:However, I can't seem to get it to work for training on multiple GPUs. I also tried using torchrun but haven't found the right argument combination.
I'm looking for a minimal example of launching
train/train.py
with FSDP, without the use ofrun/train.py
. At the same time I'd like to enable multi-GPU training using FSDP.The text was updated successfully, but these errors were encountered: