Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Catch cuda driver shutdown error in NCCLWatchdog (pytorch#106503)
There is a design flaw in NCCLWatchdog, namely it spawns threads that talk to the CUDA api, but the CUDA api may have been deinitialized, forming a race. This is a known issue with widespread impact (pytorch#90848). I should point out that i tested this fix on the repro command for pytorch#82632 by running `NCCL_DESYNC_DEBUG=1 CUDA_LAUNCH_BLOCKING=1 python test/distributed/test_c10d_nccl.py -k test_find_unused_parameters_kwarg_debug_detail` and observing that instead of crashing, we observe log messages with the exception string about the cuda driver shutdown error. A partial fix was landed already, but it applied too narrowly: pytorch@ec071a0 This PR is a copy-paste of the previous fix, applying to one more case, plugging a hole. We probably need to do a more thorough review and either plug all the holes, or design this differently. Pull Request resolved: pytorch#106503 Approved by: https://github.com/kwen2501
- Loading branch information