Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Launching train/train.py directly without Slurm #161

Open
vladchimescu opened this issue Aug 14, 2023 · 24 comments
Open

Launching train/train.py directly without Slurm #161

vladchimescu opened this issue Aug 14, 2023 · 24 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@vladchimescu
Copy link

Hi,
I am trying to launch dinov2/train/train.py script directly without the Slurm scheduler. I use the following command to launch the training:

export CUDA_VISIBLE_DEVICES=0,1 && python dinov2/train/train.py --config_file myconfig.yaml --output-dir my_outputdir

However, I can't seem to get it to work for training on multiple GPUs. I also tried using torchrun but haven't found the right argument combination.

I'm looking for a minimal example of launching train/train.py with FSDP, without the use of run/train.py. At the same time I'd like to enable multi-GPU training using FSDP.

@usryokousha
Copy link

just use

export CUDA_VISIBLE_DEVICES=0,1
export PYTHONPATH=absolute/workspace/directory
python -m torch.distributed.launch --nproc_per_node=2 dinov2/train/train.py --config-file=myconfig.yaml --output-dir=my_outputdir --use-env 

@vladchimescu
Copy link
Author

@usryokousha Thanks!
I managed to get it running without the --use-env flag:

export CUDA_VISIBLE_DEVICES=0,1 && python -m torch.distributed.launch --nproc_per_node=2 dinov2/train/train.py --config-file=myconfig.yaml --output-dir=my_outputdir 

In fact, feeding --use-env resulted in an error as it was an unrecognised argument to the script. I guess one could add it to the argparser.

By the way, I had to add the following in dinov2/train/train.py:

parser.add_argument("--local-rank", default=0, type=int, help="Variable for distributed computing.") 

Multi-GPU training definitely works, but weirdly it shows the current_batch_size: 128.0000, which is my batch size per GPU. I would have expected for it to show 256 ( = 128 * 2 GPUs)?

@patricklabatut patricklabatut added the documentation Improvements or additions to documentation label Aug 23, 2023
@patricklabatut patricklabatut self-assigned this Aug 23, 2023
@qasfb
Copy link
Contributor

qasfb commented Aug 24, 2023

It's just a logging issue, it displays the batch size per gpu; maybe we can put a better name

@BenSpex
Copy link

BenSpex commented Aug 29, 2023

@patricklabatut and @usryokousha
Any reason to use python -m torch.distributed.launch over torchrun ? At least to the pytorch documentation
Torchrun offers more fault tolerance etc.

@GravityZL
Copy link

Look into the code: dinov2/distributed/init.py, simply changes the:

self.local_rank = int(os.environ["LOCAL_RANK"]) in method: def _set_from_azure_env(self): to: self.local_world_size = torch.cuda.device_count()

then it works.

Or as vladchimescu mentioned, add one argument:

parser.add_argument("--local-rank", default=0, type=int, help="Variable for distributed computing.")

and start your training with:

export CUDA_VISIBLE_DEVICES=xx,xx

Hope it helps :)

@GravityZL
Copy link

@patricklabatut and @usryokousha Any reason to use python -m torch.distributed.launch over torchrun ? At least to the pytorch documentation Torchrun offers more fault tolerance etc.

both work, but python -m torch.distributed.launch is/will be deprecated

@Shizhen-ZHAO
Copy link

Shizhen-ZHAO commented Sep 30, 2023

Hi, I try to launch train/train.py directly without Slurm using your methods.
But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird.
Did you encounter the same problem?

@GZ-YourZY
Copy link

请问可以公布多一点多GPU训练的修改细节吗,两种方式均尝试后,我的代码仍然只使用单张GPU进行训练

你好,我尝试使用你的方法直接启动 train/train.py 而不是使用 Slurm 。 但我发现使用多 GPU 的训练时间比仅使用一个 GPU 慢旋转。太奇怪了。 您遇到同样的问题吗? ?

@GravityZL
Copy link

请问可以公布多一点多GPU训练的修改细节吗,两种方式均尝试后,我的代码仍然只使用单张GPU进行训练

你好,我尝试使用你的方法直接启动 train/train.py 而不是使用 Slurm 。 但我发现使用多 GPU 的训练时间比仅使用一个 GPU 慢旋转。太奇怪了。 您遇到同样的问题吗? ?

Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?

Maybe you need to set the sampler type from INFINITE to DISTRIBUTED

@GravityZL
Copy link

Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?

I had the same issue for multi node training but not for multi gpu within one node

@qasfb
Copy link
Contributor

qasfb commented Nov 8, 2023

I depends on your inter-node connectivity

@GravityZL
Copy link

I depends on your inter-node connectivity

I have infiniband for the internode connection, but I check the whole training process, the infiniband is not really used. I wondered if I dont have slurm in the cluster, how can I enable the distributed training with the same speed(or at least comparable)? Thank you

@qasfb
Copy link
Contributor

qasfb commented Nov 8, 2023

if infiniband is not used, maybe there is a problem with the cluster configuration ?
are you able to run nccl-tests, and does it give the perf that it should ? https://github.com/NVIDIA/nccl-tests

i think maybe you could copy-paste the pytorch distributed initialization functions from a setup that you are sure works on your cluster

@usryokousha
Copy link

usryokousha commented Nov 9, 2023 via email

@GravityZL
Copy link

if infiniband is not used, maybe there is a problem with the cluster configuration ? are you able to run nccl-tests, and does it give the perf that it should ? https://github.com/NVIDIA/nccl-tests

i think maybe you could copy-paste the pytorch distributed initialization functions from a setup that you are sure works on your cluster

Thank you! I have solved the issue, it is indeed the cluster is not properly setup

@adipill04
Copy link

I was able to run train.py directly without SLURM thanks to this thread. However, now I am faced with the challenge of trying to use 2 GPUs to train the model on my dataset. When running the training, my reports show me that the 2nd GPU isn't being used at all if not very little. My question: is there any other change I need to make to the training script to ensure it uses more than 1 GPU? Thanks

@ahmed1996said
Copy link

I was able to run train.py directly without SLURM thanks to this thread. However, now I am faced with the challenge of trying to use 2 GPUs to train the model on my dataset. When running the training, my reports show me that the 2nd GPU isn't being used at all if not very little. My question: is there any other change I need to make to the training script to ensure it uses more than 1 GPU? Thanks

Same issue here @adipill04. I'm using torchrun:

torchrun --nproc_per_node=2 dinov2/train/train.py --config-file=<PATH_TO_YAML> --output-dir=<PATH_TO_OUTPUT>

It seems to only train on a single GPU. Did you find a solution for this?

@TumVink
Copy link

TumVink commented Sep 4, 2024

Heyy,

I am also facing the problem that the training with 8 GPUs is slower or equal to the training with 2 GPUs. I tried both INFINITE and DISTRIBUTED sampler, but neither works.

Any idea to solve the problem?

@adipill04
Copy link

Heyy,

I am also facing the problem that the training with 8 GPUs is slower or equal to the training with 2 GPUs. I tried both INFINITE and DISTRIBUTED sampler, but neither works.

Any idea to solve the problem?

Are you using a cluster like SLURM? Are you also configuring an environment variable called 'NCCL_P2P_DISABLE'? I ran into some similar issues as you with multi gpu training not speeding up the training process and had to disable p2p communication for it to even start. The underlying issue in my case was a wrong hardware configuration.

I'd suggest using weights and biases (wandb.ai) to track model training metrics if you don't already. It might be useful for understanding what your bottleneck is.

@TumVink
Copy link

TumVink commented Sep 4, 2024

Heyy Thanks for your reply @adipill04 .

No I am not using slurm cluster, rather a DDP torchrun framework. Within the node, GPUs are communicated with each other via NV-Link. I tuned the num_workers and found that the speed seems more normal now.

@adipill04
Copy link

Heyy Thanks for your reply @adipill04 .

No I am not using slurm cluster, rather a DDP torchrun framework. Within the node, GPUs are communicated with each other via NV-Link. I tuned the num_workers and found that the speed seems more normal now.

Great! To add on, for num_workers I felt setting it to 4 * # of gpus yielded fastest speeds on dataloader side of things

@TumVink
Copy link

TumVink commented Sep 5, 2024

Yes, that makes sense when you have an efficient data loader pipeline!

@sipie800
Copy link

sipie800 commented Sep 6, 2024

+1

@zetaSaahil
Copy link

I am also having speed issues, multi node speed is way worse than single node. Its almost like the speed reduces linearly with increasing nodes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests