You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This error does not happen in lightning==2.0.1, which is what I had installed by default from viscy. However, I tried upgrading to lighting 2.3.0.dev0 to circumvent the caching timeout issue here, but I got the following error for which we will have to make sure our tensors are on the right device according to this. Flagging it just in case you also encounter it @ziw-liu .
Traceback (most recent call last):
File "/hpc/projects/comp.micro/virtual_staining/models/fcmae-3d/fit/pretrain_scratch_path.py", line 141, in <module>
trainer.fit(model, data)
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
return function(*args, **kwargs)
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
results = self._run_stage()
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1031, in _run_stage
self._run_sanity_check()
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1060, in _run_sanity_check
val_loop.run()
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 142, in run
return self.on_run_end()
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 254, in on_run_end
self._on_evaluation_epoch_end()
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 336, in _on_evaluation_epoch_end
trainer._logger_connector.on_epoch_end()
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 195, in on_epoch_end
metrics = self.metrics
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py", line 234, in metrics
return self.trainer._results.metrics(on_step)
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 483, in metrics
value = self._get_cache(result_metric, on_step)
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 447, in _get_cache
result_metric.compute()
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 289, in wrapped_func
self._computed = compute(*args, **kwargs)
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py", line 249, in compute
value = self.meta.sync(self.value.clone()) # `clone` because `sync` is in-place
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/pytorch/strategies/ddp.py", line 342, in reduce
return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 173, in _sync_ddp_if_available
return _sync_ddp(result, group=group, reduce_op=reduce_op)
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/lightning/fabric/utilities/distributed.py", line 223, in _sync_ddp
torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
return func(*args, **kwargs)
File "/hpc/mydata/eduardo.hirata/.conda/envs/viscy/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1992, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: No backend type associated with device type cpu
The text was updated successfully, but these errors were encountered:
@ziw-liu and I decided to copy everything to fry1 and access through /hpc/nodes/fry1/....
This issue is for later on when either luster FS is solved or if we want to do caching without the timeout error. The timeout error was solved in 2.3.0dev. I installed that earlier in the week to see if it would work, but ran into some complications, so just documenting it here just to remember that this can happen.
This error does not happen in
lightning==2.0.1
, which is what I had installed by default from viscy. However, I tried upgrading to lighting2.3.0.dev0
to circumvent the caching timeout issue here, but I got the following error for which we will have to make sure our tensors are on the right device according to this. Flagging it just in case you also encounter it @ziw-liu .The text was updated successfully, but these errors were encountered: