You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After finishing the first epoch I see the error message. I'm using 0.7.1
To Reproduce
Train using 0.7.1
Traceback (most recent call last):
File "sparsenet_trainer.py", line 115, in <module>
main(hparams)
File "sparsenet_trainer.py", line 67, in main
trainer.fit(model)
File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 590, in fit
mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 342, in ddp_train
self.run_pretrain_routine(model)
File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 830, in run_pretrain_routine
self.train()
File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 343, in train
self.run_training_epoch()
File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 452, in run_training_epoch
self.call_checkpoint_callback()
File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 737, in call_checkpoint_callback
self.checkpoint_callback.on_validation_end(self, self.get_model())
File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 204, in on_validation_end
self._do_check_save(filepath, current, epoch)
File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 221, in _do_check_save
self._del_model(delpath)
File "/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 121, in _del_model
os.remove(filepath)
FileNotFoundError: [Errno 2] No such file or directory: '/data/Code/trec-2019-deep-learning/trec2019/model/sparsenet/default/version_5/checkpoints/epoch=0.ckpt'
/home/kyoungrok/anaconda3/envs/trec/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown
len(cache))
Environment
PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1
OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect
Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: TITAN V
GPU 1: TITAN V
Nvidia driver version: 440.64
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.4
Versions of relevant libraries:
[pip] numpy==1.17.2
[pip] numpydoc==0.9.1
[pip] pytorch-lightning==0.7.1
[pip] torch==1.4.0
[pip] torchtext==0.5.0
[pip] torchvision==0.5.0
[conda] blas 1.0 mkl
[conda] mkl 2019.4 243
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.0.14 py37ha843d7b_0
[conda] mkl_random 1.1.0 py37hd6b4f25_0
[conda] pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch
[conda] pytorch-lightning 0.7.1 pypi_0 pypi
[conda] torchtext 0.5.0 pypi_0 pypi
[conda] torchvision 0.5.0 py37_cu101 pytorch
The text was updated successfully, but these errors were encountered:
🐛 Bug
After finishing the first epoch I see the error message. I'm using
0.7.1
To Reproduce
Train using
0.7.1
Environment
The text was updated successfully, but these errors were encountered: