Training is stuck if ranks don't have the same keys in metrics dict at the end of validation epoch #7129

kazhang · 2021-04-20T20:42:31Z

🐛 Bug

In my D2Go model training, the evaluation metrics are reduced on rank 0 and logged in validation_epoch_end hook. The other ranks will just log an empty dict.
However, the recent change #6417 will always run _all_gather. I noticed that the training is stuck there until a NCCL timeout error kicks in.

Please reproduce using the BoringModel

To Reproduce

Use following BoringModel and post here

Apply follow code in validation_epoch_end hook and run with 2 GPUs.

def validation_epoch_end(self, outputs) -> None:
        rank = get_rank()
        res = {}
        if rank == 0:
            # reduce metric across ranks
            res = {"reduced_metric": 0.1}
        self.log_dict(res)

https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3?usp=sharing
(note: I'm not sure how to run distributed training in colab, so I'm not able to reproduce in above notebook)

Expected behavior

Do not try to reduce on missing metric keys.

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

IDE: Please, use our python bug_report_model.py template.
Colab Notebook: Please copy and paste the output from our environment collection script (or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py

PyTorch Version (e.g., 1.0):
OS (e.g., Linux):
How you installed PyTorch (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

tchaton · 2021-04-21T06:39:10Z

Dear @kazhang,

Would you mind sharing the need to log different metrics on the ranks.
Lightning expects the same keys if reduction are performed between processes, and therefore this will cause the processes to hang.

Best,
T.C

kazhang · 2021-04-21T17:14:24Z

Dear @tchaton,

Thanks for looking into this! In Detectron2, prediction are gathered on rank 0, then it returns the evaluated and reduced results. There is no need for other ranks to evaluate their local predictions, therefore, other ranks just return an empty dict.

I tried to only log results on rank 0 with sync_dist=False, but the process still hang at the warning.

Another use case I met is that because of error handling, some ranks don't return the full metrics.

Regards,
Kai

fmassa · 2021-04-27T11:17:59Z

@kazhang another option is to distribute the evaluation over different GPUs like we do in torchvision detection reference scripts. Not only this makes evaluation faster, but also avoid all but one worker being idle.

kazhang added bug Something isn't working help wanted Open to be worked on labels Apr 20, 2021

kaushikb11 added the priority: 0 High priority task label Apr 20, 2021

tchaton mentioned this issue Apr 21, 2021

[bugfix] Remove warning for distributed values #7132

Merged

11 tasks

carmocca closed this as completed in #7132 Apr 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training is stuck if ranks don't have the same keys in metrics dict at the end of validation epoch #7129

Training is stuck if ranks don't have the same keys in metrics dict at the end of validation epoch #7129

kazhang commented Apr 20, 2021

tchaton commented Apr 21, 2021 •

edited

Loading

kazhang commented Apr 21, 2021

fmassa commented Apr 27, 2021

Training is stuck if ranks don't have the same keys in metrics dict at the end of validation epoch #7129

Training is stuck if ranks don't have the same keys in metrics dict at the end of validation epoch #7129

Comments

kazhang commented Apr 20, 2021

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

Additional context

tchaton commented Apr 21, 2021 • edited Loading

kazhang commented Apr 21, 2021

fmassa commented Apr 27, 2021

tchaton commented Apr 21, 2021 •

edited

Loading