-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training stuck running on the SLURM cluster with multiple gpus per node #6206
Comments
Thanks for reporting, could you update the issue with the pytorch lightning version you used, please? |
Removing |
oh that's interesting. @DerJFK can you confirm that? So maybe this means we need to tweak the logic for determining the world size with and without the if you print the WORLD_SIZE, is it the expected number you selected (num tasks per node * num nodes)? |
@awaelchli When I remove the
How can I print the WORLD_SIZE? Sorry, for the slow response. EDIT: I used |
Where is |
According to the comments above I updated my minimal example. The third code block below shows the output, but it seems not correct. If I use one gpu per node it looks as expected with registering every member. It seems that the problem is if one uses multiple gpu per node. Below you find the updated minimal working example. import pytorch_lightning as pl
import torch
from torch import nn
import socket
class Module(pl.LightningModule):
def __init__(self):
super().__init__()
self.linear = nn.Linear(5, 1)
def configure_optimizers(self):
return torch.optim.Adam(self.linear.parameters())
def training_step(self, batch, batch_idx):
return self.linear(batch).sum()
def validation_step(self, batch, batch_idx):
return batch_idx
def validation_epoch_end(self, outputs):
print("VALIDATING", len(outputs))
if __name__ == "__main__":
m = Module()
datasets = [torch.rand([5]) for __ in range(200)]
train_loader = torch.utils.data.DataLoader(datasets, batch_size=16)
val_loader = torch.utils.data.DataLoader(datasets, batch_size=16)
trainer = pl.Trainer(
gpus=-1, # num_nodes=2,
accelerator='ddp',
max_epochs=2
)
print(socket.gethostname(),'pre_node:',trainer.node_rank)
print(socket.gethostname(),'pre_local:',trainer.local_rank)
print(socket.gethostname(),'pre_global:',trainer.global_rank)
trainer.fit(m, train_loader, val_loader)
print(socket.gethostname(),'post_node:',trainer.node_rank)
print(socket.gethostname(),'post_local:',trainer.local_rank)
print(socket.gethostname(),'post_global:',trainer.global_rank) #!/bin/bash
#SBATCH -o slurm_outfiles/out-%j-%A-%a.out
#SBATCH -N 4
#SBATCH -c 40
#SBATCH --gres=gpu:2
#SBATCH -t 24:00:00
#SBATCH --mail-type=ALL
#SBATCH --mem 60G
source activate kardio
srun python torch_ddp_toy.py
|
Hi, we recently fixed a problem with the environment variables in SLURM in #6941. It's in the 1.2.8 release. Please try it and let me know if that resolves your problem.
|
Thanks for your answer. I used the same code as above and just changed the parameters to use two gpu and two nodes trainer = pl.Trainer(
gpus=2, num_nodes=2,
accelerator='ddp',
max_epochs=2
) The output is little different but still get stuck at the same stage.
|
I have the same issue (@DerJFK ) with 8 GPUs 2 nodes on version 1.2.10. Even when removing the
|
Resolved the issue: The problem occurs because the MASTER_ADDR environment variable is different for both nodes depending on the version of SLURM. You can check the naming scheme for your own SLURM nodes using After that you can create your own SLURMEnvironment with modification to the |
@haideraltahan is our parsing for SLURM master address wrong? |
@awaelchli yes, it seems that there are variations within the naming scheme for SLURM. I do not think there is a perfect solution but the plugin resolves it. A continuation to the problem is I get this error:
|
https://discuss.pytorch.org/t/multiprocessing-failed-with-torch-distributed-launch-module/33056/5 This post recommends removing rank and world size from init_process_group, which current version of lightning has it (here): So please change that to dist.init_process_group(backend=backend, init_method=“env://”) Also, you should not set WORLD_SIZE, RANK env variables in your code either since they will be set by launch utility." |
I'm not sure I understand. Our Trainer launches the processes. In slurm we don't use the torch.distributed.launch. |
I might be also getting confused. I will need to look into it further. |
And I might also be the wrong person to ask here, since I have never trained on a slurm cluster. I'm just trying to give high-level comments for understanding of how it's done in Lightning. |
Hi again. |
@djberenberg If you are on the SLURM cluster it should be using the SLURMClusterEnvironment, and it should read the env variables available in SLURM. |
It requires the env variables from SLURM to be detected
|
This part of the code determines if SLURM is detected: https://github.com/PyTorchLightning/pytorch-lightning/blob/e126649d19e85c449b007008361f10374878f2f4/pytorch_lightning/trainer/connectors/accelerator_connector.py#L636 |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
This is what resolved the problem. These variables are important for it to work at least on the SLURM version that my institution is using. Here is the change in my script for allocation that resolved the problem:
Before it was:
Hope this helps 😄 |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
It appears to be another reason for the problem but I am not able to find what is it. I am executing a dummy code:
And the code gets stuck:
Just in case I was doing something wrong I printed the global variables indicated in the code above and the result was:
As you can see |
🐛 Bug
I try to train a model across multiple nodes on a slurm cluster, where each node has two gpus. Therefore, I use the following flags in the trainer:
and submit the job with
sbatch run_training.sh
. However, I end up with the following output and nothing happens further:Are there any other flags I miss? Thanks for any help. Below you find the content of the files used above.
run_training.sh
torch_ddp_toy.py
UPDATE: added version of PyTorch Lightning
The text was updated successfully, but these errors were encountered: