Training stuck running on the SLURM cluster with multiple gpus per node #6206

DerJFK · 2021-02-25T19:51:50Z

🐛 Bug

I try to train a model across multiple nodes on a slurm cluster, where each node has two gpus. Therefore, I use the following flags in the trainer:

trainer = pl.Trainer(
      gpus=2, num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )

and submit the job with sbatch run_training.sh . However, I end up with the following output and nothing happens further:

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4

Are there any other flags I miss? Thanks for any help. Below you find the content of the files used above.

run_training.sh

#!/bin/bash
#SBATCH -o slurm_outfiles/autoencoder-%j-%A-%a.out
#SBATCH -N 2
#SBATCH -c 40
#SBATCH --gres=gpu:2
#SBATCH -t 24:00:00
#SBATCH --mail-type=ALL
#SBATCH --mem 60G

srun python torch_ddp_toy.py

torch_ddp_toy.py

import pytorch_lightning as pl
import torch
from torch import nn

class Module(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(5, 1)

    def configure_optimizers(self):
        return torch.optim.Adam(self.linear.parameters())

    def training_step(self, batch, batch_idx):
        return self.linear(batch).sum()

    def validation_step(self, batch, batch_idx):
        return batch_idx

    def validation_epoch_end(self, outputs):
        print("VALIDATING", len(outputs))


if __name__ == "__main__":
    m = Module()

    datasets = [torch.rand([5]) for __ in range(100)]
    train_loader = torch.utils.data.DataLoader(datasets, batch_size=8)
    val_loader = torch.utils.data.DataLoader(datasets, batch_size=1)

    trainer = pl.Trainer(
      gpus=2, num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )
    trainer.fit(m, train_loader, val_loader)

PyTorch version 1.7.1
PyTorch Lightning version 1.2.0
CentOS Linux release 8.1.1911
PyTorch installed via conda
PyTorch Lightning via pip
slurm 20.02.3

UPDATE: added version of PyTorch Lightning

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-02-26T00:08:30Z

Thanks for reporting, could you update the issue with the pytorch lightning version you used, please?

hkmztrk · 2021-03-01T14:14:58Z

Removing num_nodes argument from training configuration solved the same problem for me.

awaelchli · 2021-03-01T23:14:53Z

oh that's interesting. @DerJFK can you confirm that? So maybe this means we need to tweak the logic for determining the world size with and without the num_nodes argument.

if you print the WORLD_SIZE, is it the expected number you selected (num tasks per node * num nodes)?

DerJFK · 2021-03-03T09:23:07Z

@awaelchli When I remove the num_nodes flag I do not get any error and this output

GPU available: True, used: True
TPU available: None, using: 0 TPU cores
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
GPU available: True, used: True
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
TPU available: None, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Set SLURM handle signals.
Set SLURM handle signals.

How can I print the WORLD_SIZE? Sorry, for the slow response.

EDIT: I used -N 4 this time in slurm. So there should be 4 nodes

dthiagarajan · 2021-03-19T19:21:24Z

Where is trainer.node_rank set? I see that's used in the method that sets the ranks for all the processes.

DerJFK · 2021-03-30T17:46:13Z

According to the comments above I updated my minimal example. The third code block below shows the output, but it seems not correct. If I use one gpu per node it looks as expected with registering every member.

It seems that the problem is if one uses multiple gpu per node.

Below you find the updated minimal working example.

import pytorch_lightning as pl
import torch
from torch import nn
import socket

class Module(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(5, 1)

    def configure_optimizers(self):
        return torch.optim.Adam(self.linear.parameters())

    def training_step(self, batch, batch_idx):
        return self.linear(batch).sum()

    def validation_step(self, batch, batch_idx):
        return batch_idx

    def validation_epoch_end(self, outputs):
        print("VALIDATING", len(outputs))


if __name__ == "__main__":
    m = Module()

    datasets = [torch.rand([5]) for __ in range(200)]
    train_loader = torch.utils.data.DataLoader(datasets, batch_size=16)
    val_loader = torch.utils.data.DataLoader(datasets, batch_size=16)

    trainer = pl.Trainer(
      gpus=-1, # num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )
    print(socket.gethostname(),'pre_node:',trainer.node_rank)
    print(socket.gethostname(),'pre_local:',trainer.local_rank)
    print(socket.gethostname(),'pre_global:',trainer.global_rank)
    trainer.fit(m, train_loader, val_loader)
    print(socket.gethostname(),'post_node:',trainer.node_rank)
    print(socket.gethostname(),'post_local:',trainer.local_rank)
    print(socket.gethostname(),'post_global:',trainer.global_rank)

#!/bin/bash
#SBATCH -o slurm_outfiles/out-%j-%A-%a.out
#SBATCH -N 4
#SBATCH -c 40
#SBATCH --gres=gpu:2
#SBATCH -t 24:00:00
#SBATCH --mail-type=ALL
#SBATCH --mem 60G

source activate kardio

srun python torch_ddp_toy.py

GPU available: True, used: True
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
TPU available: None, using: 0 TPU cores
GPU available: True, used: True
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
TPU available: None, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/2
Set SLURM handle signals.
Set SLURM handle signals.

  | Name   | Type   | Params
----------------------------------
0 | linear | Linear | 6
----------------------------------
6         Trainable params
0         Non-trainable params
6         Total params
0.000     Total estimated model params size (MB)
cluster-node-126.cluster pre_node: 0
cluster-node-126.cluster pre_local: 0
cluster-node-126.cluster pre_global: 0

Validation sanity check: 0it [00:00, ?it/s]
Validation sanity check:   0%|          | 0/2 [00:00<?, ?it/s]VALIDATING 2



Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/14 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s] 
Epoch 0:   7%|▋         | 1/14 [00:00<00:05,  2.58it/s]
Epoc 0:   7%|▋         | 1/14 [00:00<00:05,  2.58it/s, loss=-3.72, v_num=655642]
Epoch 0:  14%|█▍        | 2/14 [00:00<00:02,  5.13it/s, loss=-4.5, v_num=655642] 
Epoch 0:  21%|██▏       | 3/14 [00:00<00:01,  7.66it/s, loss=4.42, v_num=655642]
Epoch 0:  29%|██▊       | 4/14 [00:00<00:00, 10.17it/s, loss=-4.21, v_num=655642]
Epoch 0:  36%|███▌      | 5/14 [00:00<00:00, 12.67it/s, loss=-4.32, v_num=655642]
Epoch 0:  43%|████▎     | 6/14 [00:00<00:00, 15.14it/s, loss=-4.45, v_num=655642]
Epoch 0:  50%|█████     | 7/14 [00:00<00:00, 17.59it/s, loss=-3.98, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 0: 100%|██████████| 14/14 [00:00<00:00, 34.88it/s, loss=-3.98, v_num=655642]

                                                 �[A
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s, loss=-3.98, v_num=655642]
Epoch 1:   0%|          | 0/14 [00:00<?, ?it/s, loss=-3.98, v_num=655642]
Epoch 1:   7%|▋         | 1/14 [00:00<00:00, 530.32it/s, loss=-4.2, v_num=655642]
Epoch 1:  14%|█▍        | 2/14 [00:00<00:00, 563.75it/s, loss=-4.38, v_num=655642]
Epoch 1:  21%|██▏       | 3/14 [00:00<00:00, 571.66it/s, loss=-4.37, v_num=655642]
Epoch 1:  29%|██▊       | 4/14 [00:00<00:00, 575.98it/s, loss=-4.4, v_num=655642] 
Epoch 1  36%|███▌      | 5/14 [00:00<00:00, 580.86it/s, loss=-4.39, v_num=655642]
Epoch 1:  43%|████▎     | 6/14 [00:00<00:00, 584.93it/s, loss=-4.41, v_num=655642]
Epoch 1:  50%|█████     | 7/14 [00:00<00:00, 588.11it/s, loss=-4.14, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 930.07it/s, loss=-4.14, v_num=655642]

                                                 �[Acluster-node-126.cluster pre_node: 0
cluster-node-126.cluster pre_local: 0
cluster-node-126.cluster pre_global: 0
VALIDATING 2
VALIDATING 7
VALIDATING 7
cluster-node-126.cluster post_node: 0
cluster-node-126.cluster post_local: 1
cluster-node-126.cluster post_global: 1

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 620.02it/s, loss=-4.14, v_num=655642]
cluster-node-126.cluster post_node: 0
cluster-node-126.cluster post_local: 0
cluster-node-126.cluster post_global: 0
Set SLURM handle signals.
Set SLURM handle signals.

  | Name   | Type   | Params
----------------------------------
0 | linear | Linear | 6
----------------------------------
6         Trainable params
0         Non-trainable params
6         Total params
0.000     Total estimated model params size (MB)
cluster-node-046.cluster pre_node: 0
cluster-node-046.cluster pre_local: 0
cluster-node-046.cluster pre_global: 0

Validation sanity check: 0it [00:00, ?it/s]
Validation sanity check:   0%|          | 0/2 [00:00<?, ?it/s]VALIDATING 2
Set SLURM handle signals.
Set SLURM handle signals.

  | Name   | Type   | Params
----------------------------------
0 | linear | Linear | 6
----------------------------------
6         Trainable params
0         Non-trainable params
6         Total params
0.000     Total estimated model params size (MB)
cluster-node-023.cluster pre_node: 0
cluster-node-023.cluster pre_local: 0
cluster-node-023.cluster pre_global: 0

Validation sanity check: 0it [00:00, ?it/s]
Validation sanity check:   0%|          | 0/2 [00:00<?, ?it/s]VALIDATING 2



Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/14 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s] 
Epoch 0:   7%|▋         | 1/14 [00:00<00:04,  2.61it/s]
Epoc 0:   7%|▋         | 1/14 [00:00<00:04,  2.61it/s, loss=-6.21, v_num=655642]
Epoch 0:  14%|█▍        | 2/14 [00:00<00:02,  5.18it/s, loss=-5.73, v_num=655642]
Epoch 0:  21%|██▏       | 3/14 [00:00<00:01,  7.74it/s, loss=-5.81, v_num=655642]
Epoch 0:  29%|██▊       | 4/14 [00:00<00:00, 10.27it/s, loss=-5.46, v_num=655642]
Epoch 0:  36%|███▌      | 5/14 [00:00<00:00, 12.78it/s, loss=-5.35, v_num=655642]
Epoch 0:  43%|████▎     | 6/14 [00:00<00:00, 15.27it/s, loss=-5.29, v_num=655642]
Epoch 0:  50%|█████     | 7/14 [00:00<00:00, 17.74it/s, loss=-4.65, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 0: 100%|██████████| 14/14 [00:00<00:00, 35.19it/s, loss=-4.65, v_num=655642]

                                                 �[A
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s, loss=-4.65, v_num=655642]
Epoch 1:   0%|          | 0/14 [00:00<?, ?it/s, loss=-4.65, v_num=655642]
Epoch 1:   7%|▋         | 1/14 [00:00<00:00, 522.52it/s, loss=-4.8, v_num=655642]
Epoch 1:  14%|█▍        | 2/14 [00:00<00:00, 541.38it/s, loss=-4.95, v_num=655642]
Epoch 1:  21%|██▏       | 3/14 [00:00<00:00, 547.61it/s, loss=-5.14, v_num=655642]
Epoch 1:  29%|██▊       | 4/14 [00:00<00:00, 550.99it/s, loss=-5.23, v_num=655642]
Epoch 1:  36%|███▌      | 5/14 [00:00<00:00, 562.21it/s, loss=-5.37, v_num=655642]
Epoch 1:  43%|████▎     | 6/14 [00:00<00:00, 565.28it/s, loss=-5.37, v_num=655642]
Epoch 1:  50%|█████     | 7/14 [00:00<00:00, 569.39it/s, loss=-5.06, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 910.14it/s, loss=-5.06, v_num=655642]

                                                 �[Acluster-node-046.cluster pre_node: 0
cluster-node-046.cluster pre_local: 0
cluster-node-046.cluster pre_global: 0
VALIDATING 2
VALIDATING 7
VALIDATING 7
cluster-node-046.cluster post_node: 0
cluster-node-046.cluster post_local: 1
cluster-node-046.cluster post_global: 1

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 694.77it/s, loss=-5.06, v_num=655642]
cluster-node-046.cluster post_node: 0
cluster-node-046.cluster post_local: 0
cluster-node-046.cluster post_global: 0

                                                              

Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/14 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s] 
Epoch 0:   7%|▋         | 1/14 [00:00<00:05,  2.50it/s]
Epoch 0:   7%|▋         | 1/14 [00:00<00:05,  2.50it/s, loss=4.61, v_num=655642]
Epoch 0:  14%|█▍        | 2/14 [00:00<00:02,  4.97it/s, loss=4.69, v_num=655642]
Epoch 0:  21%|██▏       | 3/14 [00:00<00:01,  7.42it/s, loss=4.71, v_num=655642]
Epoch 0:  29%|██▊       | 4/14 [00:00<00:01,  9.86it/s, loss=4.63, v_num=655642]
Epoch 0:  36%|███▌      | 5/14 [00:00<00:00, 12.26it/s, loss=4.8, v_num=655642] 
Epoch 0:  43%|████▎     | 6/14 [00:00<00:00, 14.65it/s, loss=4.86, v_num=655642]
Epoch 0:  50%|█████     | 7/14 [00:00<00:00, 17.02it/s, loss=4.35, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 0: 100%|██████████| 14/14 [00:00<00:00, 33.77it/s, loss=4.35, v_num=655642]

                                                 �[A
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s, loss=4.35, v_num=655642]
Epoch 1:   0%|          | 0/14 [00:00<?, ?it/s, loss=4.35, v_num=655642]
Epoch 1:   7%|▋         | 1/14 [00:00<00:00, 513.32it/s, loss=4.45, v_num=655642]
Epoch 1:  14%|█▍        | 2/14 [00:00<00:00, 531.90it/s, loss=4.37, v_num=655642]
Epoch 1:  21%|██▏       | 3/14 [00:00<00:00, 538.49it/s, loss=4.4, v_num=655642] 
Epoch 1  29%|██▊       | 4/14 [00:00<00:00, 540.12it/s, loss=4.44, v_num=655642]
Epoch 1:  36%|███▌      | 5/14 [00:00<00:00, 543.91it/s, loss=4.43, v_num=655642]
Epoch 1:  43%|████▎     | 6/14 [00:00<00:00, 545.98it/s, loss=4.41, v_num=655642]
Epoch 1:  50%|█████     | 7/14 [00:00<00:00, 549.03it/s, loss=4.19, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 882.19it/s, loss=4.19, v_num=655642]

                                                 �[Acluster-node-023.cluster pre_node: 0
cluster-node-023.cluster pre_local: 0
cluster-node-023.cluster pre_global: 0
VALIDATING 2
VALIDATING 7
VALIDATING 7
cluster-node-023.cluster post_node: 0
cluster-node-023.cluster post_local: 1
cluster-node-023.cluster post_global: 1

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 616.22it/s, loss=4.19, v_num=655642]
cluster-node-023.cluster post_node: 0
cluster-node-023.cluster post_local: 0
cluster-node-023.cluster post_global: 0
Set SLURM handle signals.
Set SLURM handle signals.

  | Name   | Type   | Params
----------------------------------
0 | linear | Linear | 6
----------------------------------
6         Trainable params
0         Non-trainable params
6         Total params
0.000     Total estimated model params size (MB)
cluster-node-004.cluster pre_node: 0
cluster-node-004.cluster pre_local: 0
cluster-node-004.cluster pre_global: 0

Validation sanity check: 0it [00:00, ?it/s]
Validation sanity check:   0%|          | 0/2 [00:00<?, ?it/s]VALIDATING 2



Training: 0it [00:00, ?it/s]
Training:   0%|          | 0/14 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s] 
Epoch 0:   7%|▋         | 1/14 [00:00<00:04,  2.66it/s]
Epoc 0:   7%|▋         | 1/14 [00:00<00:04,  2.66it/s, loss=9.36, v_num=655642]
Epoch 0:  14%|█▍        | 2/14 [00:00<00:02,  5.28it/s, loss=9.56, v_num=655642]
Epoch 0:  21%|██▏       | 3/14 [00:00<00:01,  7.88it/s, loss=9.37, v_num=655642]
Epoch 0:  29%|██▊       | 4/14 [00:00<00:00, 10.46it/s, loss=9.47, v_num=655642]
Epoch 0:  36%|███▌      | 5/14 [00:00<00:00, 13.02it/s, loss=9.13, v_num=655642]
Epoch 0:  43%|████▎     | 6/14 [00:00<00:00, 15.55it/s, loss=9.33, v_num=655642]
Epoch 0:  50%|█████     | 7/14 [00:00<00:00, 18.06it/s, loss=8.27, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 0: 100%|██████████| 14/14 [00:00<00:00, 35.82it/s, loss=8.27, v_num=655642]

                                                 �[A
Epoch 0:   0%|          | 0/14 [00:00<?, ?it/s, loss=8.27, v_num=655642]
Epoch 1:   0%|          | 0/14 [00:00<?, ?it/s, loss=8.27, v_num=655642]
Epoch 1:   7%|▋         | 1/14 [00:00<00:00, 516.09it/s, loss=8.44, v_num=655642]
Epoch 1:  14%|█▍        | 2/14 [00:00<00:00, 548.38it/s, loss=8.48, v_num=655642]
Epoch 1:  21%|██▏       | 3/14 [00:00<00:00, 555.22it/s, loss=8.37, v_num=655642]
Epoch 1:  29%|██▊       | 4/14 [00:00<00:00, 555.68it/s, loss=8.45, v_num=655642]
Epoch 1:  36%|███▌      | 5/14 [00:00<00:00, 556.23it/s, loss=8.47, v_num=655642]
Epoch 1:  43%|████▎     | 6/14 [00:00<00:00, 560.19it/s, loss=8.49, v_num=655642]
Epoch 1:  50%|█████     | 7/14 [00:00<00:00, 563.16it/s, loss=8.06, v_num=655642]

Validating: 0it [00:00, ?it/s]�[A

Validating:   0%|          | 0/7 [00:00<?, ?it/s]�[AVALIDATING 7

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 899.21it/s, loss=8.06, v_num=655642]

                                                 �[Acluster-node-004.cluster pre_node: 0
cluster-node-004.cluster pre_local: 0
cluster-node-004.cluster pre_global: 0
VALIDATING 2
VALIDATING 7
VALIDATING 7
cluster-node-004.cluster post_node: 0
cluster-node-004.cluster post_local: 1
cluster-node-004.cluster post_global: 1

Epoch 1: 100%|██████████| 14/14 [00:00<00:00, 628.00it/s, loss=8.06, v_num=655642]
cluster-node-004.cluster post_node: 0
cluster-node-004.cluster post_local: 0
cluster-node-004.cluster post_global: 0

awaelchli · 2021-04-16T21:45:09Z

Hi, we recently fixed a problem with the environment variables in SLURM in #6941. It's in the 1.2.8 release. Please try it and let me know if that resolves your problem.

pip install -U pytorch-lightning

DerJFK · 2021-04-20T17:25:56Z

Thanks for your answer. I used the same code as above and just changed the parameters to use two gpu and two nodes

trainer = pl.Trainer(
      gpus=2,  num_nodes=2,
      accelerator='ddp',
      max_epochs=2
    )

The output is little different but still get stuck at the same stage.

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4

haideraltahan · 2021-05-05T07:28:24Z

I have the same issue (@DerJFK ) with 8 GPUs 2 nodes on version 1.2.10. Even when removing the num_nodes parameter, the issue continues. When removing num_nodes, it operates as num_nodes=1 which means that the two nodes are running the training separately rather than cooperating. The issue seems to originate from the fact that both nodes act as the first node. Hence, the local rank overrides the global rank, as the global_rank index is repeating with num_nodes rather than incremented. Resulting in both nodes stalling for the other node to finish joining the queue.

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/8
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/8

haideraltahan · 2021-05-05T10:17:03Z

Resolved the issue:

The problem occurs because the MASTER_ADDR environment variable is different for both nodes depending on the version of SLURM. You can check the naming scheme for your own SLURM nodes using echo "NODELIST="${SLURM_NODELIST}.

After that you can create your own SLURMEnvironment with modification to the def master_address(self) -> str method such that it correct parese the SLURM_NODELIST variable to return the root node. Lastly, go to your trainer and add plugins=[SLURMEnvironment()], with your modified SLURMEnvironment class.

awaelchli · 2021-05-05T10:18:29Z

@haideraltahan is our parsing for SLURM master address wrong?

haideraltahan · 2021-05-05T10:21:31Z

@awaelchli yes, it seems that there are variations within the naming scheme for SLURM. I do not think there is a perfect solution but the plugin resolves it.

A continuation to the problem is I get this error:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/8
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
initializing ddp: GLOBAL_RANK: 4, MEMBER: 5/8
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/8


Traceback (most recent call last):
  File "/lustre04/scratch/haltaha/CLAVR/main.py", line 267, in <module>
    fire.Fire(main)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/lustre04/scratch/haltaha/CLAVR/main.py", line 170, in main
    trainer.fit(net, data)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 496, in fit
    self.pre_dispatch()
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 525, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 83, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 243, in pre_dispatch
    self.init_ddp_connection(self.global_rank, self.world_size)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 226, in init_ddp_connection
    torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use


Traceback (most recent call last):
  File "main.py", line 267, in <module>
    fire.Fire(main)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "main.py", line 170, in main
    trainer.fit(net, data)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 496, in fit
    self.pre_dispatch()
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 525, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 83, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 243, in pre_dispatch
    self.init_ddp_connection(self.global_rank, self.world_size)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 226, in init_ddp_connection
    torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/haltaha/ENV/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 201, in _store_based_barrier
    worker_count = store.add(store_key, 0)
RuntimeError: Connection reset by peer```

haideraltahan · 2021-05-05T10:34:46Z

https://discuss.pytorch.org/t/multiprocessing-failed-with-torch-distributed-launch-module/33056/5

This post recommends removing rank and world size from init_process_group, which current version of lightning has it (here):
"you should not set world_size and rank in torch.distributed.init_process_group, they are automatically set by torch.distributed.launch.

So please change that to dist.init_process_group(backend=backend, init_method=“env://”)

Also, you should not set WORLD_SIZE, RANK env variables in your code either since they will be set by launch utility."
author: teng-li

awaelchli · 2021-05-05T10:44:06Z

I'm not sure I understand. Our Trainer launches the processes. In slurm we don't use the torch.distributed.launch.
The thread you linked is not about SLURM as far as I can tell.

haideraltahan · 2021-05-05T10:45:10Z

I might be also getting confused. I will need to look into it further.

awaelchli · 2021-05-05T10:47:53Z

And I might also be the wrong person to ask here, since I have never trained on a slurm cluster. I'm just trying to give high-level comments for understanding of how it's done in Lightning.
I'm currently trying to build a slurm cluster for us and as i continue to learn I hope I can help resolve this issue one day.

awaelchli · 2021-05-06T14:32:02Z

Hi again.
I had a chat over here in the discussion with someone who is training successfully on SLURM #7275. Regarding the parsing, they must be using a different SLURM version but still, maybe you can follow my print statements there an double check all the variables rank, local rank, world size are correctly set before trainer.fit?

awaelchli · 2021-05-17T03:19:14Z

@djberenberg If you are on the SLURM cluster it should be using the SLURMClusterEnvironment, and it should read the env variables available in SLURM.
Make sure to set Trainer(num_gpus=..., num_nodes=...)

awaelchli · 2021-05-17T03:46:45Z

It requires the env variables from SLURM to be detected

SLURM_JOB_ID
SLURM_PROCID
SLURM_LOCALID
SLURM_NODEID
SLURM_NTASKS

SLURM_NTASKS must match num_noses * num_gpus in the Trainer.

awaelchli · 2021-05-17T03:47:56Z

This part of the code determines if SLURM is detected: https://github.com/PyTorchLightning/pytorch-lightning/blob/e126649d19e85c449b007008361f10374878f2f4/pytorch_lightning/trainer/connectors/accelerator_connector.py#L636

stale · 2021-06-16T09:33:02Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

haideraltahan · 2021-06-18T20:31:29Z

It requires the env variables from SLURM to be detected
SLURM_JOB_ID
SLURM_PROCID
SLURM_LOCALID
SLURM_NODEID
SLURM_NTASKS
SLURM_NTASKS must match num_noses * num_gpus in the Trainer.

This is what resolved the problem. These variables are important for it to work at least on the SLURM version that my institution is using. Here is the change in my script for allocation that resolved the problem:

#SBATCH --tasks-per-node=4
#SBATCH --mem 185G
#SBATCH --cpus-per-task=8
#SBATCH --job-name=train
#SBATCH -o slurm.%x.%j.out
#SBATCH --gres=gpu:v100l:4
#SBATCH --time=44:00:00

Before it was:

#SBATCH --mem 185G
#SBATCH -c 32
#SBATCH --job-name=train
#SBATCH -o slurm.%x.%j.out
#SBATCH --gres=gpu:v100l:4
#SBATCH --time=44:00:00

Hope this helps 😄

stale · 2021-08-06T17:01:56Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Naxel100 · 2022-08-18T00:14:08Z

It appears to be another reason for the problem but I am not able to find what is it. I am executing a dummy code:

'''
This codes creates a simple one hidden layer neural network to be trained
with a dummy dataset. The network is trained with Trainer from Pytorch Lightning
'''

import os
import torch
import torch.nn as nn
import pytorch_lightning as pl

class Dataset(torch.utils.data.Dataset):
    def __init__(self):
        self.x = torch.arange(-1, 1, 0.02)
        self.y = 3 * self.x + torch.randn(self.x.size()) * 0.33

    def __getitem__(self, index):
        return self.x[index], self.y[index]

    def __len__(self):
        return self.x.size(0)


# Create a simple neural network
class Net(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(1, 10)
        self.fc2 = nn.Linear(10, 1)
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        return x

    def training_step(self, batch, _):
        x, y = batch
        y_hat = self(x)
        loss = nn.functional.mse_loss(y_hat, y)
        self.log('train_loss', loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)



# Create train and test dataloaders
train_loader = torch.utils.data.DataLoader(Dataset(), batch_size=64)

# Some prints that might be useful
print('SLURM_NTASKS =', os.environ['SLURM_NTASKS'])
print('SLURM_TASKS_PER_NODE =', os.environ['SLURM_TASKS_PER_NODE'])
print('SLURM_GPUS_PER_NODE =', os.environ['SLURM_GPUS_PER_NODE'])
print('SLURM_NNODES =', os.environ['SLURM_NNODES'])

# Create a model
model = Net()

# Create a trainer
trainer = pl.Trainer(max_epochs=10, gpus=2, num_nodes=4, strategy="ddp")

# Train the model
trainer.fit(model, train_loader)

And the code gets stuck:

Multiprocessing is handled by SLURM.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8

Just in case I was doing something wrong I printed the global variables indicated in the code above and the result was:

SLURM_NTASKS = 8
SLURM_TASKS_PER_NODE = 2(x4)
SLURM_GPUS_PER_NODE = volta:2
SLURM_NNODES = 4

As you can see num_nodes is equal to the number of nodes requested in SLURM and the SLURM_TASKS_PER_NODE is equal to the gpus requested in the trainer. If I remove the num_nodes variables from the Trainer call, the code works but not as I would expect. The code only creates two processes (so I understand it is only using one node instead of the 4 requested).

DerJFK added bug Something isn't working help wanted Open to be worked on labels Feb 25, 2021

awaelchli added the environment: slurm label Feb 26, 2021

awaelchli added the distributed Generic distributed-related topic label Feb 26, 2021

edenlightning added waiting on author Waiting on user action, correction, or update and removed help wanted Open to be worked on labels Apr 19, 2021

stale bot added the won't fix This will not be worked on label Jun 16, 2021

stale bot removed the won't fix This will not be worked on label Jun 18, 2021

stale bot added the won't fix This will not be worked on label Aug 6, 2021

stale bot closed this as completed Aug 14, 2021

kritiyer mentioned this issue Oct 22, 2021

Code stuck on "initalizing ddp" when using more than one gpu #4612

Closed

awaelchli mentioned this issue Aug 7, 2022

RFC: Remove num_nodes Trainer argument and infer world size from cluster environment directly #14078

Open

Naxel100 mentioned this issue Aug 18, 2022

Multi-ndoes DDP training hanging when initializing. #10098

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training stuck running on the SLURM cluster with multiple gpus per node #6206

Training stuck running on the SLURM cluster with multiple gpus per node #6206

DerJFK commented Feb 25, 2021 •

edited

Loading

awaelchli commented Feb 26, 2021

hkmztrk commented Mar 1, 2021

awaelchli commented Mar 1, 2021 •

edited

Loading

DerJFK commented Mar 3, 2021 •

edited

Loading

dthiagarajan commented Mar 19, 2021

DerJFK commented Mar 30, 2021 •

edited

Loading

awaelchli commented Apr 16, 2021 •

edited

Loading

DerJFK commented Apr 20, 2021 •

edited

Loading

haideraltahan commented May 5, 2021 •

edited

Loading

haideraltahan commented May 5, 2021 •

edited

Loading

awaelchli commented May 5, 2021

haideraltahan commented May 5, 2021 •

edited

Loading

haideraltahan commented May 5, 2021

awaelchli commented May 5, 2021

haideraltahan commented May 5, 2021

awaelchli commented May 5, 2021 •

edited

Loading

awaelchli commented May 6, 2021

awaelchli commented May 17, 2021

awaelchli commented May 17, 2021 •

edited

Loading

awaelchli commented May 17, 2021

stale bot commented Jun 16, 2021

haideraltahan commented Jun 18, 2021 •

edited

Loading

stale bot commented Aug 6, 2021

Naxel100 commented Aug 18, 2022

Training stuck running on the SLURM cluster with multiple gpus per node #6206

Training stuck running on the SLURM cluster with multiple gpus per node #6206

Comments

DerJFK commented Feb 25, 2021 • edited Loading

🐛 Bug

awaelchli commented Feb 26, 2021

hkmztrk commented Mar 1, 2021

awaelchli commented Mar 1, 2021 • edited Loading

DerJFK commented Mar 3, 2021 • edited Loading

dthiagarajan commented Mar 19, 2021

DerJFK commented Mar 30, 2021 • edited Loading

awaelchli commented Apr 16, 2021 • edited Loading

DerJFK commented Apr 20, 2021 • edited Loading

haideraltahan commented May 5, 2021 • edited Loading

haideraltahan commented May 5, 2021 • edited Loading

awaelchli commented May 5, 2021

haideraltahan commented May 5, 2021 • edited Loading

haideraltahan commented May 5, 2021

awaelchli commented May 5, 2021

haideraltahan commented May 5, 2021

awaelchli commented May 5, 2021 • edited Loading

awaelchli commented May 6, 2021

awaelchli commented May 17, 2021

awaelchli commented May 17, 2021 • edited Loading

awaelchli commented May 17, 2021

stale bot commented Jun 16, 2021

haideraltahan commented Jun 18, 2021 • edited Loading

stale bot commented Aug 6, 2021

Naxel100 commented Aug 18, 2022

DerJFK commented Feb 25, 2021 •

edited

Loading

awaelchli commented Mar 1, 2021 •

edited

Loading

DerJFK commented Mar 3, 2021 •

edited

Loading

DerJFK commented Mar 30, 2021 •

edited

Loading

awaelchli commented Apr 16, 2021 •

edited

Loading

DerJFK commented Apr 20, 2021 •

edited

Loading

haideraltahan commented May 5, 2021 •

edited

Loading

haideraltahan commented May 5, 2021 •

edited

Loading

haideraltahan commented May 5, 2021 •

edited

Loading

awaelchli commented May 5, 2021 •

edited

Loading

awaelchli commented May 17, 2021 •

edited

Loading

haideraltahan commented Jun 18, 2021 •

edited

Loading