No such file or directory '.../epoch=0.ckpt' #1134

kyoungrok0517 · 2020-03-13T00:55:49Z

🐛 Bug

After finishing the first epoch I see the error message. I'm using 0.7.1

To Reproduce

Train using 0.7.1

Traceback (most recent call last):
  File "sparsenet_trainer.py", line 115, in <module>
    main(hparams)
  File "sparsenet_trainer.py", line 67, in main
    trainer.fit(model)
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 590, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 342, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
    self.train()
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
    self.run_training_epoch()
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch
    self.call_checkpoint_callback()
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 737, in call_checkpoint_callback
    self.checkpoint_callback.on_validation_end(self, self.get_model())
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self._do_check_save(filepath, current, epoch)
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 221, in _do_check_save
    self._del_model(delpath)
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 121, in _del_model
    os.remove(filepath)
FileNotFoundError: [Errno 2] No such file or directory: '/data/Code/trec-2019-deep-learning/trec2019/model/sparsenet/default/version_5/checkpoints/epoch=0.ckpt'

/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown
  len(cache))

Environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V

Nvidia driver version: 440.64
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4

Versions of relevant libraries:
[pip] numpy==1.17.2
[pip] numpydoc==0.9.1
[pip] pytorch-lightning==0.7.1
[pip] torch==1.4.0
[pip] torchtext==0.5.0
[pip] torchvision==0.5.0
[conda] blas                      1.0                         mkl
[conda] mkl                       2019.4                      243
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.0.14           py37ha843d7b_0
[conda] mkl_random                1.1.0            py37hd6b4f25_0
[conda] pytorch                   1.4.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] pytorch-lightning         0.7.1                    pypi_0    pypi
[conda] torchtext                 0.5.0                    pypi_0    pypi
[conda] torchvision               0.5.0                py37_cu101    pytorch

The text was updated successfully, but these errors were encountered:

ghost · 2020-03-13T00:59:01Z

If this error occurs when you're running in DDP mode the fix is in master. Refer to #1119.

kyoungrok0517 · 2020-03-13T01:05:05Z

I’ll try that. Thanks!

kyoungrok0517 added bug Something isn't working help wanted Open to be worked on labels Mar 13, 2020

Borda added question Further information is requested and removed bug Something isn't working labels Mar 14, 2020

kyoungrok0517 closed this as completed Mar 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No such file or directory '.../epoch=0.ckpt' #1134

No such file or directory '.../epoch=0.ckpt' #1134

kyoungrok0517 commented Mar 13, 2020

ghost commented Mar 13, 2020

kyoungrok0517 commented Mar 13, 2020 via email •

edited by Borda

Loading

No such file or directory '.../epoch=0.ckpt' #1134

No such file or directory '.../epoch=0.ckpt' #1134

Comments

kyoungrok0517 commented Mar 13, 2020

🐛 Bug

To Reproduce

Environment

ghost commented Mar 13, 2020

kyoungrok0517 commented Mar 13, 2020 via email • edited by Borda Loading

kyoungrok0517 commented Mar 13, 2020 via email •

edited by Borda

Loading