Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training stuck running on the SLURM cluster with multiple gpus per node #6206

Closed
DerJFK opened this issue Feb 25, 2021 · 24 comments
Closed

Training stuck running on the SLURM cluster with multiple gpus per node #6206

DerJFK opened this issue Feb 25, 2021 · 24 comments
Labels
bug Something isn't working distributed Generic distributed-related topic environment: slurm waiting on author Waiting on user action, correction, or update won't fix This will not be worked on

Comments

@DerJFK
Copy link

DerJFK commented Feb 25, 2021

🐛 Bug

I try to train a model across multiple nodes on a slurm cluster, where each node has two gpus. Therefore, I use the following flags in the trainer:

trainer = pl.Trainer(
      gpus=2, num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )

and submit the job with sbatch run_training.sh . However, I end up with the following output and nothing happens further:

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4

Are there any other flags I miss? Thanks for any help. Below you find the content of the files used above.

run_training.sh

#!/bin/bash
#SBATCH -o slurm_outfiles/autoencoder-%j-%A-%a.out
#SBATCH -N 2
#SBATCH -c 40
#SBATCH --gres=gpu:2
#SBATCH -t 24:00:00
#SBATCH --mail-type=ALL
#SBATCH --mem 60G

srun python torch_ddp_toy.py

torch_ddp_toy.py

import pytorch_lightning as pl
import torch
from torch import nn

class Module(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(5, 1)

    def configure_optimizers(self):
        return torch.optim.Adam(self.linear.parameters())

    def training_step(self, batch, batch_idx):
        return self.linear(batch).sum()

    def validation_step(self, batch, batch_idx):
        return batch_idx

    def validation_epoch_end(self, outputs):
        print("VALIDATING", len(outputs))


if __name__ == "__main__":
    m = Module()

    datasets = [torch.rand([5]) for __ in range(100)]
    train_loader = torch.utils.data.DataLoader(datasets, batch_size=8)
    val_loader = torch.utils.data.DataLoader(datasets, batch_size=1)

    trainer = pl.Trainer(
      gpus=2, num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )
    trainer.fit(m, train_loader, val_loader)
  • PyTorch version 1.7.1
  • PyTorch Lightning version 1.2.0
  • CentOS Linux release 8.1.1911
  • PyTorch installed via conda
  • PyTorch Lightning via pip
  • slurm 20.02.3

UPDATE: added version of PyTorch Lightning

@DerJFK DerJFK added bug Something isn't working help wanted Open to be worked on labels Feb 25, 2021
@awaelchli
Copy link
Contributor

Thanks for reporting, could you update the issue with the pytorch lightning version you used, please?

@awaelchli awaelchli added the distributed Generic distributed-related topic label Feb 26, 2021
@hkmztrk
Copy link

hkmztrk commented Mar 1, 2021

Removing num_nodes argument from training configuration solved the same problem for me.

@awaelchli
Copy link
Contributor

awaelchli commented Mar 1, 2021

oh that's interesting. @DerJFK can you confirm that? So maybe this means we need to tweak the logic for determining the world size with and without the num_nodes argument.

if you print the WORLD_SIZE, is it the expected number you selected (num tasks per node * num nodes)?

@DerJFK
Copy link
Author

DerJFK commented Mar 3, 2021

@awaelchli When I remove the num_nodes flag I do not get any error and this output

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
GPU available: True, used: True
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
TPU available: None, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Set SLURM handle signals.
Set SLURM handle signals.

How can I print the WORLD_SIZE? Sorry, for the slow response.

EDIT: I used -N 4 this time in slurm. So there should be 4 nodes

@dthiagarajan
Copy link

Where is trainer.node_rank set? I see that's used in the method that sets the ranks for all the processes.

@DerJFK
Copy link
Author

DerJFK commented Mar 30, 2021

According to the comments above I updated my minimal example. The third code block below shows the output, but it seems not correct. If I use one gpu per node it looks as expected with registering every member.

It seems that the problem is if one uses multiple gpu per node.

Below you find the updated minimal working example.

import pytorch_lightning as pl
import torch
from torch import nn
import socket

class Module(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(5, 1)

    def configure_optimizers(self):
        return torch.optim.Adam(self.linear.parameters())

    def training_step(self, batch, batch_idx):
        return self.linear(batch).sum()

    def validation_step(self, batch, batch_idx):
        return batch_idx

    def validation_epoch_end(self, outputs):
        print("VALIDATING", len(outputs))


if __name__ == "__main__":
    m = Module()

    datasets = [torch.rand([5]) for __ in range(200)]
    train_loader = torch.utils.data.DataLoader(datasets, batch_size=16)
    val_loader = torch.utils.data.DataLoader(datasets, batch_size=16)

    trainer = pl.Trainer(
      gpus=-1, # num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )
    print(socket.gethostname(),'pre_node:',trainer.node_rank)
    print(socket.gethostname(),'pre_local:',trainer.local_rank)
    print(socket.gethostname(),'pre_global:',trainer.global_rank)
    trainer.fit(m, train_loader, val_loader)
    print(socket.gethostname(),'post_node:',trainer.node_rank)
    print(socket.gethostname(),'post_local:',trainer.local_rank)
    print(socket.gethostname(),'post_global:',trainer.global_rank)
#!/bin/bash
#SBATCH -o slurm_outfiles/out-%j-%A-%a.out
#SBATCH -N 4
#SBATCH -c 40
#SBATCH --gres=gpu:2
#SBATCH -t 24:00:00
#SBATCH --mail-type=ALL
#SBATCH --mem 60G

source activate kardio

srun python torch_ddp_toy.py
GPU available: True, used: True
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
TPU available: None, using: 0 TPU cores
GPU available: True, used: True
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
TPU available: None, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Set SLURM handle signals.
Set SLURM handle signals.

  | Name   | Type   | Params
----------------------------------
0 | linear | Linear | 6
----------------------------------
6         Trainable params
0         Non-trainable params
6         Total params
0.000     Total estimated model params size (MB)
cluster-node-126.cluster pre_node: 0
cluster-node-126.cluster pre_local: 0
cluster-node-126.cluster pre_global: 0

Validation sanity check: 0it [00:00, ?it/s]
Validation sanity check:   0%|          | 0/2 [00:00<?, ?it/s]VALIDATING 2



Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/14 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s] 
Epoch 0:   7%|▋         | 1/14 [00:00<00:05,  2.58it/s]
Epoc 0:   7%|▋         | 1/14 [00:00<00:05,  2.58it/s, loss=-3.72, v_num=655642]
Epoch 0:  14%|█▍        | 2/14 [00:00<00:02,  5.13it/s, loss=-4.5, v_num=655642] 
Epoch 0:  21%|██▏       | 3/14 [00:00<00:01,  7.66it/s, loss=4.42, v_num=655642]
Epoch 0:  29%|██▊       | 4/14 [00:00<00:00, 10.17it/s, loss=-4.21, v_num=655642]
Epoch 0:  36%|███▌      | 5/14 [00:00<00:00, 12.67it/s, loss=-4.32, v_num=655642]
Epoch 0:  43%|████▎     | 6/14 [00:00<00:00, 15.14it/s, loss=-4.45, v_num=655642]
Epoch 0:  50%|█████     | 7/14 [00:00<00:00, 17.59it/s, loss=-3.98, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 0: 100%|██████████| 14/14 [00:00<00:00, 34.88it/s, loss=-3.98, v_num=655642]

                                                 �[A
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s, loss=-3.98, v_num=655642]
Epoch 1:   0%|          | 0/14 [00:00<?, ?it/s, loss=-3.98, v_num=655642]
Epoch 1:   7%|▋         | 1/14 [00:00<00:00, 530.32it/s, loss=-4.2, v_num=655642]
Epoch 1:  14%|█▍        | 2/14 [00:00<00:00, 563.75it/s, loss=-4.38, v_num=655642]
Epoch 1:  21%|██▏       | 3/14 [00:00<00:00, 571.66it/s, loss=-4.37, v_num=655642]
Epoch 1:  29%|██▊       | 4/14 [00:00<00:00, 575.98it/s, loss=-4.4, v_num=655642] 
Epoch 1  36%|███▌      | 5/14 [00:00<00:00, 580.86it/s, loss=-4.39, v_num=655642]
Epoch 1:  43%|████▎     | 6/14 [00:00<00:00, 584.93it/s, loss=-4.41, v_num=655642]
Epoch 1:  50%|█████     | 7/14 [00:00<00:00, 588.11it/s, loss=-4.14, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 930.07it/s, loss=-4.14, v_num=655642]

                                                 �[Acluster-node-126.cluster pre_node: 0
cluster-node-126.cluster pre_local: 0
cluster-node-126.cluster pre_global: 0
VALIDATING 2
VALIDATING 7
VALIDATING 7
cluster-node-126.cluster post_node: 0
cluster-node-126.cluster post_local: 1
cluster-node-126.cluster post_global: 1

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 620.02it/s, loss=-4.14, v_num=655642]
cluster-node-126.cluster post_node: 0
cluster-node-126.cluster post_local: 0
cluster-node-126.cluster post_global: 0
Set SLURM handle signals.
Set SLURM handle signals.

  | Name   | Type   | Params
----------------------------------
0 | linear | Linear | 6
----------------------------------
6         Trainable params
0         Non-trainable params
6         Total params
0.000     Total estimated model params size (MB)
cluster-node-046.cluster pre_node: 0
cluster-node-046.cluster pre_local: 0
cluster-node-046.cluster pre_global: 0

Validation sanity check: 0it [00:00, ?it/s]
Validation sanity check:   0%|          | 0/2 [00:00<?, ?it/s]VALIDATING 2
Set SLURM handle signals.
Set SLURM handle signals.

  | Name   | Type   | Params
----------------------------------
0 | linear | Linear | 6
----------------------------------
6         Trainable params
0         Non-trainable params
6         Total params
0.000     Total estimated model params size (MB)
cluster-node-023.cluster pre_node: 0
cluster-node-023.cluster pre_local: 0
cluster-node-023.cluster pre_global: 0

Validation sanity check: 0it [00:00, ?it/s]
Validation sanity check:   0%|          | 0/2 [00:00<?, ?it/s]VALIDATING 2



Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/14 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s] 
Epoch 0:   7%|▋         | 1/14 [00:00<00:04,  2.61it/s]
Epoc 0:   7%|▋         | 1/14 [00:00<00:04,  2.61it/s, loss=-6.21, v_num=655642]
Epoch 0:  14%|█▍        | 2/14 [00:00<00:02,  5.18it/s, loss=-5.73, v_num=655642]
Epoch 0:  21%|██▏       | 3/14 [00:00<00:01,  7.74it/s, loss=-5.81, v_num=655642]
Epoch 0:  29%|██▊       | 4/14 [00:00<00:00, 10.27it/s, loss=-5.46, v_num=655642]
Epoch 0:  36%|███▌      | 5/14 [00:00<00:00, 12.78it/s, loss=-5.35, v_num=655642]
Epoch 0:  43%|████▎     | 6/14 [00:00<00:00, 15.27it/s, loss=-5.29, v_num=655642]
Epoch 0:  50%|█████     | 7/14 [00:00<00:00, 17.74it/s, loss=-4.65, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 0: 100%|██████████| 14/14 [00:00<00:00, 35.19it/s, loss=-4.65, v_num=655642]

                                                 �[A
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s, loss=-4.65, v_num=655642]
Epoch 1:   0%|          | 0/14 [00:00<?, ?it/s, loss=-4.65, v_num=655642]
Epoch 1:   7%|▋         | 1/14 [00:00<00:00, 522.52it/s, loss=-4.8, v_num=655642]
Epoch 1:  14%|█▍        | 2/14 [00:00<00:00, 541.38it/s, loss=-4.95, v_num=655642]
Epoch 1:  21%|██▏       | 3/14 [00:00<00:00, 547.61it/s, loss=-5.14, v_num=655642]
Epoch 1:  29%|██▊       | 4/14 [00:00<00:00, 550.99it/s, loss=-5.23, v_num=655642]
Epoch 1:  36%|███▌      | 5/14 [00:00<00:00, 562.21it/s, loss=-5.37, v_num=655642]
Epoch 1:  43%|████▎     | 6/14 [00:00<00:00, 565.28it/s, loss=-5.37, v_num=655642]
Epoch 1:  50%|█████     | 7/14 [00:00<00:00, 569.39it/s, loss=-5.06, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 910.14it/s, loss=-5.06, v_num=655642]

                                                 �[Acluster-node-046.cluster pre_node: 0
cluster-node-046.cluster pre_local: 0
cluster-node-046.cluster pre_global: 0
VALIDATING 2
VALIDATING 7
VALIDATING 7
cluster-node-046.cluster post_node: 0
cluster-node-046.cluster post_local: 1
cluster-node-046.cluster post_global: 1

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 694.77it/s, loss=-5.06, v_num=655642]
cluster-node-046.cluster post_node: 0
cluster-node-046.cluster post_local: 0
cluster-node-046.cluster post_global: 0

                                                              

Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/14 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s] 
Epoch 0:   7%|▋         | 1/14 [00:00<00:05,  2.50it/s]
Epoch 0:   7%|▋         | 1/14 [00:00<00:05,  2.50it/s, loss=4.61, v_num=655642]
Epoch 0:  14%|█▍        | 2/14 [00:00<00:02,  4.97it/s, loss=4.69, v_num=655642]
Epoch 0:  21%|██▏       | 3/14 [00:00<00:01,  7.42it/s, loss=4.71, v_num=655642]
Epoch 0:  29%|██▊       | 4/14 [00:00<00:01,  9.86it/s, loss=4.63, v_num=655642]
Epoch 0:  36%|███▌      | 5/14 [00:00<00:00, 12.26it/s, loss=4.8, v_num=655642] 
Epoch 0:  43%|████▎     | 6/14 [00:00<00:00, 14.65it/s, loss=4.86, v_num=655642]
Epoch 0:  50%|█████     | 7/14 [00:00<00:00, 17.02it/s, loss=4.35, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 0: 100%|██████████| 14/14 [00:00<00:00, 33.77it/s, loss=4.35, v_num=655642]

                                                 �[A
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s, loss=4.35, v_num=655642]
Epoch 1:   0%|          | 0/14 [00:00<?, ?it/s, loss=4.35, v_num=655642]
Epoch 1:   7%|▋         | 1/14 [00:00<00:00, 513.32it/s, loss=4.45, v_num=655642]
Epoch 1:  14%|█▍        | 2/14 [00:00<00:00, 531.90it/s, loss=4.37, v_num=655642]
Epoch 1:  21%|██▏       | 3/14 [00:00<00:00, 538.49it/s, loss=4.4, v_num=655642] 
Epoch 1  29%|██▊       | 4/14 [00:00<00:00, 540.12it/s, loss=4.44, v_num=655642]
Epoch 1:  36%|███▌      | 5/14 [00:00<00:00, 543.91it/s, loss=4.43, v_num=655642]
Epoch 1:  43%|████▎     | 6/14 [00:00<00:00, 545.98it/s, loss=4.41, v_num=655642]
Epoch 1:  50%|█████     | 7/14 [00:00<00:00, 549.03it/s, loss=4.19, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 882.19it/s, loss=4.19, v_num=655642]

                                                 �[Acluster-node-023.cluster pre_node: 0
cluster-node-023.cluster pre_local: 0
cluster-node-023.cluster pre_global: 0
VALIDATING 2
VALIDATING 7
VALIDATING 7
cluster-node-023.cluster post_node: 0
cluster-node-023.cluster post_local: 1
cluster-node-023.cluster post_global: 1

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 616.22it/s, loss=4.19, v_num=655642]
cluster-node-023.cluster post_node: 0
cluster-node-023.cluster post_local: 0
cluster-node-023.cluster post_global: 0
Set SLURM handle signals.
Set SLURM handle signals.

  | Name   | Type   | Params
----------------------------------
0 | linear | Linear | 6
----------------------------------
6         Trainable params
0         Non-trainable params
6         Total params
0.000     Total estimated model params size (MB)
cluster-node-004.cluster pre_node: 0
cluster-node-004.cluster pre_local: 0
cluster-node-004.cluster pre_global: 0

Validation sanity check: 0it [00:00, ?it/s]
Validation sanity check:   0%|          | 0/2 [00:00<?, ?it/s]VALIDATING 2



Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/14 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s] 
Epoch 0:   7%|▋         | 1/14 [00:00<00:04,  2.66it/s]
Epoc 0:   7%|▋         | 1/14 [00:00<00:04,  2.66it/s, loss=9.36, v_num=655642]
Epoch 0:  14%|█▍        | 2/14 [00:00<00:02,  5.28it/s, loss=9.56, v_num=655642]
Epoch 0:  21%|██▏       | 3/14 [00:00<00:01,  7.88it/s, loss=9.37, v_num=655642]
Epoch 0:  29%|██▊       | 4/14 [00:00<00:00, 10.46it/s, loss=9.47, v_num=655642]
Epoch 0:  36%|███▌      | 5/14 [00:00<00:00, 13.02it/s, loss=9.13, v_num=655642]
Epoch 0:  43%|████▎     | 6/14 [00:00<00:00, 15.55it/s, loss=9.33, v_num=655642]
Epoch 0:  50%|█████     | 7/14 [00:00<00:00, 18.06it/s, loss=8.27, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 0: 100%|██████████| 14/14 [00:00<00:00, 35.82it/s, loss=8.27, v_num=655642]

                                                 �[A
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s, loss=8.27, v_num=655642]
Epoch 1:   0%|          | 0/14 [00:00<?, ?it/s, loss=8.27, v_num=655642]
Epoch 1:   7%|▋         | 1/14 [00:00<00:00, 516.09it/s, loss=8.44, v_num=655642]
Epoch 1:  14%|█▍        | 2/14 [00:00<00:00, 548.38it/s, loss=8.48, v_num=655642]
Epoch 1:  21%|██▏       | 3/14 [00:00<00:00, 555.22it/s, loss=8.37, v_num=655642]
Epoch 1:  29%|██▊       | 4/14 [00:00<00:00, 555.68it/s, loss=8.45, v_num=655642]
Epoch 1:  36%|███▌      | 5/14 [00:00<00:00, 556.23it/s, loss=8.47, v_num=655642]
Epoch 1:  43%|████▎     | 6/14 [00:00<00:00, 560.19it/s, loss=8.49, v_num=655642]
Epoch 1:  50%|█████     | 7/14 [00:00<00:00, 563.16it/s, loss=8.06, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 899.21it/s, loss=8.06, v_num=655642]

                                                 �[Acluster-node-004.cluster pre_node: 0
cluster-node-004.cluster pre_local: 0
cluster-node-004.cluster pre_global: 0
VALIDATING 2
VALIDATING 7
VALIDATING 7
cluster-node-004.cluster post_node: 0
cluster-node-004.cluster post_local: 1
cluster-node-004.cluster post_global: 1

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 628.00it/s, loss=8.06, v_num=655642]
cluster-node-004.cluster post_node: 0
cluster-node-004.cluster post_local: 0
cluster-node-004.cluster post_global: 0

@awaelchli
Copy link
Contributor

awaelchli commented Apr 16, 2021

Hi, we recently fixed a problem with the environment variables in SLURM in #6941. It's in the 1.2.8 release. Please try it and let me know if that resolves your problem.

pip install -U pytorch-lightning

@edenlightning edenlightning added waiting on author Waiting on user action, correction, or update and removed help wanted Open to be worked on labels Apr 19, 2021
@DerJFK
Copy link
Author

DerJFK commented Apr 20, 2021

Thanks for your answer. I used the same code as above and just changed the parameters to use two gpu and two nodes

trainer = pl.Trainer(
      gpus=2,  num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )

The output is little different but still get stuck at the same stage.

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4

@haideraltahan
Copy link

haideraltahan commented May 5, 2021

I have the same issue (@DerJFK ) with 8 GPUs 2 nodes on version 1.2.10. Even when removing the num_nodes parameter, the issue continues. When removing num_nodes, it operates as num_nodes=1 which means that the two nodes are running the training separately rather than cooperating. The issue seems to originate from the fact that both nodes act as the first node. Hence, the local rank overrides the global rank, as the global_rank index is repeating with num_nodes rather than incremented. Resulting in both nodes stalling for the other node to finish joining the queue.

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8

@haideraltahan
Copy link

haideraltahan commented May 5, 2021

Resolved the issue:

The problem occurs because the MASTER_ADDR environment variable is different for both nodes depending on the version of SLURM. You can check the naming scheme for your own SLURM nodes using echo "NODELIST="${SLURM_NODELIST}.

After that you can create your own SLURMEnvironment with modification to the def master_address(self) -> str method such that it correct parese the SLURM_NODELIST variable to return the root node. Lastly, go to your trainer and add plugins=[SLURMEnvironment()], with your modified SLURMEnvironment class.

@awaelchli
Copy link
Contributor

@haideraltahan is our parsing for SLURM master address wrong?

@haideraltahan
Copy link

haideraltahan commented May 5, 2021

@awaelchli yes, it seems that there are variations within the naming scheme for SLURM. I do not think there is a perfect solution but the plugin resolves it.

A continuation to the problem is I get this error:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/8
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8


Traceback (most recent call last):
  File "/lustre04/scratch/haltaha/CLAVR/main.py", line 267, in <module>
    fire.Fire(main)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/lustre04/scratch/haltaha/CLAVR/main.py", line 170, in main
    trainer.fit(net, data)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 496, in fit
    self.pre_dispatch()
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 525, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 83, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 243, in pre_dispatch
    self.init_ddp_connection(self.global_rank, self.world_size)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 226, in init_ddp_connection
    torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use


Traceback (most recent call last):
  File "main.py", line 267, in <module>
    fire.Fire(main)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "main.py", line 170, in main
    trainer.fit(net, data)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 496, in fit
    self.pre_dispatch()
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 525, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 83, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 243, in pre_dispatch
    self.init_ddp_connection(self.global_rank, self.world_size)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 226, in init_ddp_connection
    torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 201, in _store_based_barrier
    worker_count = store.add(store_key, 0)
RuntimeError: Connection reset by peer```

@haideraltahan
Copy link

https://discuss.pytorch.org/t/multiprocessing-failed-with-torch-distributed-launch-module/33056/5

This post recommends removing rank and world size from init_process_group, which current version of lightning has it (here):
"you should not set world_size and rank in torch.distributed.init_process_group, they are automatically set by torch.distributed.launch.

So please change that to dist.init_process_group(backend=backend, init_method=“env://”)

Also, you should not set WORLD_SIZE, RANK env variables in your code either since they will be set by launch utility."
author: teng-li

@awaelchli
Copy link
Contributor

I'm not sure I understand. Our Trainer launches the processes. In slurm we don't use the torch.distributed.launch.
The thread you linked is not about SLURM as far as I can tell.

@haideraltahan
Copy link

I might be also getting confused. I will need to look into it further.

@awaelchli
Copy link
Contributor

awaelchli commented May 5, 2021

And I might also be the wrong person to ask here, since I have never trained on a slurm cluster. I'm just trying to give high-level comments for understanding of how it's done in Lightning.
I'm currently trying to build a slurm cluster for us and as i continue to learn I hope I can help resolve this issue one day.

@awaelchli
Copy link
Contributor

Hi again.
I had a chat over here in the discussion with someone who is training successfully on SLURM #7275. Regarding the parsing, they must be using a different SLURM version but still, maybe you can follow my print statements there an double check all the variables rank, local rank, world size are correctly set before trainer.fit?

@awaelchli
Copy link
Contributor

@djberenberg If you are on the SLURM cluster it should be using the SLURMClusterEnvironment, and it should read the env variables available in SLURM.
Make sure to set Trainer(num_gpus=..., num_nodes=...)

@awaelchli
Copy link
Contributor

awaelchli commented May 17, 2021

It requires the env variables from SLURM to be detected

SLURM_JOB_ID
SLURM_PROCID
SLURM_LOCALID
SLURM_NODEID
SLURM_NTASKS

SLURM_NTASKS must match num_noses * num_gpus in the Trainer.

@awaelchli
Copy link
Contributor

@stale
Copy link

stale bot commented Jun 16, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Jun 16, 2021
@haideraltahan
Copy link

haideraltahan commented Jun 18, 2021

It requires the env variables from SLURM to be detected

SLURM_JOB_ID
SLURM_PROCID
SLURM_LOCALID
SLURM_NODEID
SLURM_NTASKS

SLURM_NTASKS must match num_noses * num_gpus in the Trainer.

This is what resolved the problem. These variables are important for it to work at least on the SLURM version that my institution is using. Here is the change in my script for allocation that resolved the problem:

#SBATCH --tasks-per-node=4
#SBATCH --mem 185G
#SBATCH --cpus-per-task=8
#SBATCH --job-name=train
#SBATCH -o slurm.%x.%j.out
#SBATCH --gres=gpu:v100l:4
#SBATCH --time=44:00:00

Before it was:

#SBATCH --mem 185G
#SBATCH -c 32
#SBATCH --job-name=train
#SBATCH -o slurm.%x.%j.out
#SBATCH --gres=gpu:v100l:4
#SBATCH --time=44:00:00

Hope this helps 😄

@stale stale bot removed the won't fix This will not be worked on label Jun 18, 2021
@stale
Copy link

stale bot commented Aug 6, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@Naxel100
Copy link

It appears to be another reason for the problem but I am not able to find what is it. I am executing a dummy code:

'''
This codes creates a simple one hidden layer neural network to be trained
with a dummy dataset. The network is trained with Trainer from Pytorch Lightning
'''

import os
import torch
import torch.nn as nn
import pytorch_lightning as pl

class Dataset(torch.utils.data.Dataset):
    def __init__(self):
        self.x = torch.arange(-1, 1, 0.02)
        self.y = 3 * self.x + torch.randn(self.x.size()) * 0.33

    def __getitem__(self, index):
        return self.x[index], self.y[index]

    def __len__(self):
        return self.x.size(0)


# Create a simple neural network
class Net(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(1, 10)
        self.fc2 = nn.Linear(10, 1)
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        return x

    def training_step(self, batch, _):
        x, y = batch
        y_hat = self(x)
        loss = nn.functional.mse_loss(y_hat, y)
        self.log('train_loss', loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)



# Create train and test dataloaders
train_loader = torch.utils.data.DataLoader(Dataset(), batch_size=64)

# Some prints that might be useful
print('SLURM_NTASKS =', os.environ['SLURM_NTASKS'])
print('SLURM_TASKS_PER_NODE =', os.environ['SLURM_TASKS_PER_NODE'])
print('SLURM_GPUS_PER_NODE =', os.environ['SLURM_GPUS_PER_NODE'])
print('SLURM_NNODES =', os.environ['SLURM_NNODES'])

# Create a model
model = Net()

# Create a trainer
trainer = pl.Trainer(max_epochs=10, gpus=2, num_nodes=4, strategy="ddp")

# Train the model
trainer.fit(model, train_loader)

And the code gets stuck:

Multiprocessing is handled by SLURM.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8

Just in case I was doing something wrong I printed the global variables indicated in the code above and the result was:

SLURM_NTASKS = 8
SLURM_TASKS_PER_NODE = 2(x4)
SLURM_GPUS_PER_NODE = volta:2
SLURM_NNODES = 4

As you can see num_nodes is equal to the number of nodes requested in SLURM and the SLURM_TASKS_PER_NODE is equal to the gpus requested in the trainer. If I remove the num_nodes variables from the Trainer call, the code works but not as I would expect. The code only creates two processes (so I understand it is only using one node instead of the 4 requested).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic environment: slurm waiting on author Waiting on user action, correction, or update won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

7 participants