Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No such file or directory '.../epoch=0.ckpt' #1134

Closed
kyoungrok0517 opened this issue Mar 13, 2020 · 2 comments
Closed

No such file or directory '.../epoch=0.ckpt' #1134

kyoungrok0517 opened this issue Mar 13, 2020 · 2 comments
Labels
help wanted Open to be worked on question Further information is requested

Comments

@kyoungrok0517
Copy link

🐛 Bug

After finishing the first epoch I see the error message. I'm using 0.7.1

To Reproduce

Train using 0.7.1

Traceback (most recent call last):
  File "sparsenet_trainer.py", line 115, in <module>
    main(hparams)
  File "sparsenet_trainer.py", line 67, in main
    trainer.fit(model)
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 590, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 342, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
    self.train()
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
    self.run_training_epoch()
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch
    self.call_checkpoint_callback()
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 737, in call_checkpoint_callback
    self.checkpoint_callback.on_validation_end(self, self.get_model())
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
    self._do_check_save(filepath, current, epoch)
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 221, in _do_check_save
    self._del_model(delpath)
  File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 121, in _del_model
    os.remove(filepath)
FileNotFoundError: [Errno 2] No such file or directory: '/data/Code/trec-2019-deep-learning/trec2019/model/sparsenet/default/version_5/checkpoints/epoch=0.ckpt'

/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown
  len(cache))

Environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V

Nvidia driver version: 440.64
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4

Versions of relevant libraries:
[pip] numpy==1.17.2
[pip] numpydoc==0.9.1
[pip] pytorch-lightning==0.7.1
[pip] torch==1.4.0
[pip] torchtext==0.5.0
[pip] torchvision==0.5.0
[conda] blas                      1.0                         mkl
[conda] mkl                       2019.4                      243
[conda] mkl-service               2.3.0            py37he904b0f_0
[conda] mkl_fft                   1.0.14           py37ha843d7b_0
[conda] mkl_random                1.1.0            py37hd6b4f25_0
[conda] pytorch                   1.4.0           py3.7_cuda10.1.243_cudnn7.6.3_0    pytorch
[conda] pytorch-lightning         0.7.1                    pypi_0    pypi
[conda] torchtext                 0.5.0                    pypi_0    pypi
[conda] torchvision               0.5.0                py37_cu101    pytorch
@kyoungrok0517 kyoungrok0517 added bug Something isn't working help wanted Open to be worked on labels Mar 13, 2020
@ghost
Copy link

ghost commented Mar 13, 2020

If this error occurs when you're running in DDP mode the fix is in master. Refer to #1119.

@kyoungrok0517
Copy link
Author

kyoungrok0517 commented Mar 13, 2020 via email

@Borda Borda added question Further information is requested and removed bug Something isn't working labels Mar 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Open to be worked on question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants