You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my D2Go model training, the evaluation metrics are reduced on rank 0 and logged in validation_epoch_end hook. The other ranks will just log an empty dict.
However, the recent change #6417 will always run _all_gather. I noticed that the training is stuck there until a NCCL timeout error kicks in.
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:
Additional context
The text was updated successfully, but these errors were encountered:
Would you mind sharing the need to log different metrics on the ranks.
Lightning expects the same keys if reduction are performed between processes, and therefore this will cause the processes to hang.
Thanks for looking into this! In Detectron2, prediction are gathered on rank 0, then it returns the evaluated and reduced results. There is no need for other ranks to evaluate their local predictions, therefore, other ranks just return an empty dict.
I tried to only log results on rank 0 with sync_dist=False, but the process still hang at the warning.
Another use case I met is that because of error handling, some ranks don't return the full metrics.
@kazhang another option is to distribute the evaluation over different GPUs like we do in torchvision detection reference scripts. Not only this makes evaluation faster, but also avoid all but one worker being idle.
🐛 Bug
In my D2Go model training, the evaluation metrics are reduced on rank 0 and logged in validation_epoch_end hook. The other ranks will just log an empty dict.
However, the recent change #6417 will always run _all_gather. I noticed that the training is stuck there until a NCCL timeout error kicks in.
Please reproduce using the BoringModel
To Reproduce
Use following BoringModel and post here
Apply follow code in
validation_epoch_end
hook and run with 2 GPUs.https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3?usp=sharing
(note: I'm not sure how to run distributed training in colab, so I'm not able to reproduce in above notebook)
Expected behavior
Do not try to reduce on missing metric keys.
Environment
Note:
Bugs with code
are solved faster !Colab Notebook
should be madepublic
!IDE
: Please, use our python bug_report_model.py template.Colab Notebook
: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).You can get the script and run it with:
conda
,pip
, source):Additional context
The text was updated successfully, but these errors were encountered: