Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training is stuck if ranks don't have the same keys in metrics dict at the end of validation epoch #7129

Closed
kazhang opened this issue Apr 20, 2021 · 3 comments · Fixed by #7132
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task

Comments

@kazhang
Copy link
Contributor

kazhang commented Apr 20, 2021

🐛 Bug

In my D2Go model training, the evaluation metrics are reduced on rank 0 and logged in validation_epoch_end hook. The other ranks will just log an empty dict.
However, the recent change #6417 will always run _all_gather. I noticed that the training is stuck there until a NCCL timeout error kicks in.

Please reproduce using the BoringModel

To Reproduce

Use following BoringModel and post here

Apply follow code in validation_epoch_end hook and run with 2 GPUs.

def validation_epoch_end(self, outputs) -> None:
        rank = get_rank()
        res = {}
        if rank == 0:
            # reduce metric across ranks
            res = {"reduced_metric": 0.1}
        self.log_dict(res)

https://colab.research.google.com/drive/1HvWVVTK8j2Nj52qU4Q4YCyzOm0_aLQF3?usp=sharing
(note: I'm not sure how to run distributed training in colab, so I'm not able to reproduce in above notebook)

Expected behavior

Do not try to reduce on missing metric keys.

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

@kazhang kazhang added bug Something isn't working help wanted Open to be worked on labels Apr 20, 2021
@kaushikb11 kaushikb11 added the priority: 0 High priority task label Apr 20, 2021
@tchaton
Copy link
Contributor

tchaton commented Apr 21, 2021

Dear @kazhang,

Would you mind sharing the need to log different metrics on the ranks.
Lightning expects the same keys if reduction are performed between processes, and therefore this will cause the processes to hang.

Best,
T.C

@kazhang
Copy link
Contributor Author

kazhang commented Apr 21, 2021

Dear @tchaton,

Thanks for looking into this! In Detectron2, prediction are gathered on rank 0, then it returns the evaluated and reduced results. There is no need for other ranks to evaluate their local predictions, therefore, other ranks just return an empty dict.

I tried to only log results on rank 0 with sync_dist=False, but the process still hang at the warning.

Another use case I met is that because of error handling, some ranks don't return the full metrics.

Regards,
Kai

@fmassa
Copy link

fmassa commented Apr 27, 2021

@kazhang another option is to distribute the evaluation over different GPUs like we do in torchvision detection reference scripts. Not only this makes evaluation faster, but also avoid all but one worker being idle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 0 High priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants