Checkpoint fails in single node multi-GPU mode using DDP #1119

ghost · 2020-03-11T13:02:06Z

🐛 Bug

Checkpoint fails in single node multi-GPU mode using DDP.

To Reproduce

python pl_examples/basic_examples/gpu_template.py --distributed_backend ddp --gpus 2

Epoch 2: : 700it [00:28, 42.69it/s, l/home/xz/anaconda3/envs/x/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown                                                                                                                                                                                                                 
  len(cache))
Traceback (most recent call last):
  File "gpu_template.py", line 79, in <module>
    main(hyperparams)
  File "gpu_template.py", line 40, in main
    trainer.fit(model)
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 590, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 342, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
    self.train()
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
    self.run_training_epoch()
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch
    self.call_checkpoint_callback()
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 737, in call_checkpoint_callback
    self.checkpoint_callback.on_validation_end(self, self.get_model())
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self._do_check_save(filepath, current, epoch)
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 221, in _do_check_save
    self._del_model(delpath)
  File "/home/xz/anaconda3/envs/x/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 121, in _del_model
    os.remove(filepath)
FileNotFoundError: [Errno 2] No such file or directory: '/home/xz/pytorch-lightning/pl_examples/basic_examples/lightning_logs/version_1/checkpoints/epoch=0.ckpt'

Borda · 2020-03-11T13:43:26Z

yeah we shall run all examples in CI too

sneiman · 2020-03-11T20:39:10Z

I believe this happens with multiple gpus as well. And it only seems to happen if ModelCheckpoint(save_top_k) is set greater than 1. Still converting models to 0.7.1 but wanted to share this ...

ghost · 2020-03-12T02:19:25Z

@Borda fixed. part of the code that caused the bug was removed a few commits back.

sneiman · 2020-03-12T03:28:12Z

fix is in master?

ghost · 2020-03-12T03:32:04Z

fix for DDP checkpoint is in #1125, still waiting for it to be reviewed and merged.

I believe this happens with multiple gpus as well. And it only seems to happen if ModelChckepoint(save_top_k) is set greater than 1. Still converting models to 0.7.1 but wanted to share this ...

as for this issue, on my side it seems to work fine. can you double check?

ghost added bug Something isn't working help wanted Open to be worked on labels Mar 11, 2020

Borda added the good first issue Good for newcomers label Mar 11, 2020

ghost mentioned this issue Mar 12, 2020

Run on_validation_end only on main process in DDP #1125

Merged

Borda closed this as completed in #1125 Mar 12, 2020

ghost mentioned this issue Mar 13, 2020

No such file or directory '.../epoch=0.ckpt' #1134

Closed

areshytko mentioned this issue Apr 4, 2020

Multi-gpu DDP backend tries to delete model checkpoints from all processes #1331

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint fails in single node multi-GPU mode using DDP #1119

Checkpoint fails in single node multi-GPU mode using DDP #1119

ghost commented Mar 11, 2020

Borda commented Mar 11, 2020

sneiman commented Mar 11, 2020 •

edited

Loading

ghost commented Mar 12, 2020 •

edited by ghost

Loading

sneiman commented Mar 12, 2020

ghost commented Mar 12, 2020

Checkpoint fails in single node multi-GPU mode using DDP #1119

Checkpoint fails in single node multi-GPU mode using DDP #1119

Comments

ghost commented Mar 11, 2020

🐛 Bug

To Reproduce

Borda commented Mar 11, 2020

sneiman commented Mar 11, 2020 • edited Loading

ghost commented Mar 12, 2020 • edited by ghost Loading

sneiman commented Mar 12, 2020

ghost commented Mar 12, 2020

sneiman commented Mar 11, 2020 •

edited

Loading

ghost commented Mar 12, 2020 •

edited by ghost

Loading